faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Adding SRR outgroups to the Phyluce pipeline #342

Open bbessesen opened 1 week ago

bbessesen commented 1 week ago

Having successfully completed the Phyluce process with my raw data, I'm now trying to rerun the program with outgroup data downloaded from SRA. The new SRR sequences are raw but paired (it's unclear whether they're trimmed). I tried inserting them at four places, but all attempts failed. First, I added them at the match counting step per readthedocs, but because there was no sqline to point to, that didn’t work. So, I tried added them a step earlier at the probe matching step, which did run but generated an empty sqline and an empty match-count folder. I then tried inserting them just before the assembly; however, without the illumiprocessor step to create the splits-adapter-quality-trimmed fasta.gz files, there was nothing for the assembly to point to. I finally tried to just add them from the very beginning but realized I don’t have the tag sequences for the illum.conf file (and does it even make sense to run them through illumiprocessor if they’re already paired?). Apologies for my lack of skills! What is the appropriate process?

brantfaircloth commented 1 week ago

They need to be input at or just after the read trimming process and then assembled using phyluce_assemblo_spades. Typically, SRA reads include adapters, so you also typically want to trim them. How you do that is up to you (you don't have to use Illumiprocessor, but you do need to mostly make the directory structures something that phyluce expects).

If you use illumiprocessor, you can just trim the samples manually and then format the directories like phyluce expects. the adapter sequences you use can basically just be the outer parts of the adapter - e.g. if these are dual indexed, tru-seq libraries, then the adapter.fa file for trimmomatic looks like:

>adap1
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>adap2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

You could also trim with cutadapt using something like:

cutadapt -j 24 -m <# CPU cores> -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o SRR3493972_1.out.fastq -p SRR3493972_2.out.fastq \
../SRR3493972_1.fastq ../SRR3493972_2.fastq
bbessesen commented 1 week ago

Ideally, I would like to put them in from the beginning: starting with count reads, then running illumiprocessor. Is there way to do that if they're already paired? If yes, what does the illum.conf look like?

Here's the basic structure of original raw reads illum.conf:

[adapters] i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCTCGTATGCCGTCTTCTGCTTG i5:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTGTGTAGATCTCGGTGGTCGCCGTATCATT

[tag sequences] i7-L0063:CGCTCATT i5-L0063:ATACACTT i7-L0064:ACGTCCTG i5-L0064:TTCCATTG

[tag map] RS04B_L0063:i7-L0063,i5-L0063 RS04B_L0064:i7-L0064,i5-L0064

[names] RS04B_L0063:Hydrophis_p_ssp_BB1_USNM192279 RS04B_L0064:Hydrophis_p_ssp_BB2_AMNH106682

brantfaircloth commented 1 week ago

I'm not really sure what you mean by "already paired". do you means the reads are interleaved?

with fastqdump and fasterq dump, you can convert reads from the SRA formal to normal R1 and R2 files that are not interleaved (if that's what you mean).