Adding SRR outgroups to the Phyluce pipeline

bbessesen commented 5 months ago

Having successfully completed the Phyluce process with my raw data, I'm now trying to rerun the program with outgroup data downloaded from SRA. The new SRR sequences are raw but paired (it's unclear whether they're trimmed). I tried inserting them at four places, but all attempts failed. First, I added them at the match counting step per readthedocs, but because there was no sqline to point to, that didn’t work. So, I tried added them a step earlier at the probe matching step, which did run but generated an empty sqline and an empty match-count folder. I then tried inserting them just before the assembly; however, without the illumiprocessor step to create the splits-adapter-quality-trimmed fasta.gz files, there was nothing for the assembly to point to. I finally tried to just add them from the very beginning but realized I don’t have the tag sequences for the illum.conf file (and does it even make sense to run them through illumiprocessor if they’re already paired?). Apologies for my lack of skills! What is the appropriate process?

brantfaircloth commented 5 months ago

They need to be input at or just after the read trimming process and then assembled using phyluce_assemblo_spades. Typically, SRA reads include adapters, so you also typically want to trim them. How you do that is up to you (you don't have to use Illumiprocessor, but you do need to mostly make the directory structures something that phyluce expects).

If you use illumiprocessor, you can just trim the samples manually and then format the directories like phyluce expects. the adapter sequences you use can basically just be the outer parts of the adapter - e.g. if these are dual indexed, tru-seq libraries, then the adapter.fa file for trimmomatic looks like:

>adap1
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>adap2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

You could also trim with cutadapt using something like:

cutadapt -j 24 -m <# CPU cores> -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o SRR3493972_1.out.fastq -p SRR3493972_2.out.fastq \
../SRR3493972_1.fastq ../SRR3493972_2.fastq

bbessesen commented 5 months ago

Ideally, I would like to put them in from the beginning: starting with count reads, then running illumiprocessor. Is there way to do that if they're already paired? If yes, what does the illum.conf look like?

Here's the basic structure of original raw reads illum.conf:

[adapters] i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCTCGTATGCCGTCTTCTGCTTG i5:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTGTGTAGATCTCGGTGGTCGCCGTATCATT

[tag sequences] i7-L0063:CGCTCATT i5-L0063:ATACACTT i7-L0064:ACGTCCTG i5-L0064:TTCCATTG

[tag map] RS04B_L0063:i7-L0063,i5-L0063 RS04B_L0064:i7-L0064,i5-L0064

[names] RS04B_L0063:Hydrophis_p_ssp_BB1_USNM192279 RS04B_L0064:Hydrophis_p_ssp_BB2_AMNH106682

brantfaircloth commented 5 months ago

I'm not really sure what you mean by "already paired". do you means the reads are interleaved?

with fastqdump and fasterq dump, you can convert reads from the SRA formal to normal R1 and R2 files that are not interleaved (if that's what you mean).

bbessesen commented 4 months ago

Hi again, I was able to add the outgroup at the start of the pipeline using fake tags in the illum.conf (per a collaborator). The process ran fine and produced a phylip mafft-nexus-edge-trimmed concatenated file but when I try to make a tree with it in Geneious or iQTree, it fails. Is the process he suggested tenable? If yes, what could be the problem with the output?

Going back to my original test runs of phyluce with no-outgroup... those worked as expected, except the trees showed duplicates of all the samples, each labeled 'es_genus_species1, 2, 3...' In my efforts to figure it out, I exploded the mafft-nexus-edge-trimmed alignment. The duplicates showed up in that exploded folder but were NOT in the original exploded fastas folder.

When I exploded the outgroup alignment, it also has duplicate 'es_' files (see screenshot; notice the size of them). Have you seen this before? Screenshot 2024-07-08 072504

brantfaircloth commented 4 months ago

Hard to say, because the amount of information you've provided is not quite enough to get a full picture of what is going on.

My guess is that the es_ samples made their way into the process at some point, and phyluce (and other programs) are interpreting those data as different individuals (because of the es_ prefix). But where they made it into the process is not clear to me. The thing to do would be to go back to the monolithic fasta and see if the es_ sample names are in that file somewhere (you can search through it using less or a text editor or similar). If they are, then they made it into the process during the contig assembly/UCE identification stage... so work backwards from there to see where they might have appeared (and, possibly, how).

bbessesen commented 4 months ago

I just checked and they do not appear in the monolithic file. Which makes sense because they also do not appear in the downstream phyluce_assembly_explode_get_fastas_file folder either.

I only found them after attempting to phase the data and looking inside the folder created by: phyluce_align_explode_alignments --alignments mafft-nexus-edge-trimmed --input-format nexus --output mafft-nexus-edge-trimmed-exploded --by-taxon

Is there any additional info that would be helpful for you to troubleshoot this?

brantfaircloth commented 4 months ago

What happens if you just follow the standard approach with no phasing?

bbessesen commented 4 months ago

Sorry if I wasn't clear... the es_ files showed up in my original concatenation and trees (no phasing)... but it wasn't until some noodling around with phasing (done much later) that they were revealed them to be in the mafft-nexus-edge-trimmed alignment though not upstream.

brantfaircloth commented 4 months ago

This is still not really enough information for me to provide you with much help. The problems you are encountering may be related to the data you are using and/or how they are/were handled.

If the weird alignments are in mafft-nexus-edge-trimmed but not upstream, then it would suggest the trimming introduced the oddballs. You could turn off trimming or trim in a different way - so turn off edge-trimming during alignment (--no-trim) and then decide if you want to use something like Gblocks or TrimAL to trim the resulting alignments. Again, really hard to provide you with many tips without having some sort of actual data to look at.

bbessesen commented 4 months ago

Ok, I just ran: phyluce_align_seqcap_align --no-trim --input live-taxa-incomplete.fasta --output mafft-nexus --taxa 24 --aligner mafft --cores 20 --incomplete-matrix --log-path logtwo

Then to see the output by taxon: phyluce_align_explode_alignments --alignments mafft-nexus --input-format nexus --output mafft-nexus-exploded --by-taxon

The es_ duplicates ARE in there.

Again, they do not appear in the monolithic live-taxa-incomplete.fasta file OR in the taxon-sets/dataset/explodes-fastas folder, which only contains the normally named files (EX: Hydrophis-p-platurus-YB2301.unaligned.fasta).

brantfaircloth commented 4 months ago

Can you explode live-taxa-incomplete.fasta by locus (phyluce_assembly_explode_get_fastas_file), and attach a few loci to your post (pre-alignment) that end up containing these duplicates?

Alternatively, take a few of these files and manually run them through mafft to see what's happening. the actual mafft command that phyluce runs is:

mafft --adjustdirection --maxiterate 1000 <alignment>

bbessesen commented 4 months ago

When I call phyluce_assembly_explode_get_fastas_file --input live-taxa-incomplete.fasta --output exploded-fastas --by-locus, I get an error: unrecognized arguments: --by-locus. Is that coding correct?

brantfaircloth commented 4 months ago

It does not take --by-locus (that's the default behavior).

bbessesen commented 4 months ago

Ah, ok! Below are a few lines of the output. Scrolling through the whole folder, there don't appear to be any duplicated loci (or es_'s).

Screenshot 2024-07-09 101719

brantfaircloth commented 4 months ago

You're probably going to need to look in the files. It also could be that mafft is adding the "es_" during alignment. That's why you're going to need to run mafft.

bbessesen commented 4 months ago

No duplicates appear inside those files either. I will try running mafft separately on a few files. Could you please share the full code for doing that?

Also, es_ issues aside, going back to my original question about adding in SRR outgroups... is it viable to put them in from the start of the phyluce pipeline by using fake tag sequences in the illum.conf?

brantfaircloth commented 4 months ago

Code for mafft is above. So, for one of your alignments, you'll run:

mafft --adjustdirection --maxiterate 1000 AHE-L10.unaligned.fasta > AHE-L10.aligned

Also, es_ issues aside, going back to my original question about adding in SRR outgroups... is it viable to put them in from the start of the phyluce pipeline by using fake tag sequences in the illum.conf?

Likely to be ok. Another option is just manually find the actual tag sequences in the reads and use those. You can also manually trim, as I mentioned above, then get your files and directories renamed as phyluce expects them to be.

bbessesen commented 4 months ago

I ran the alignment and looked inside: every individual is represented just once. I then used that single AHE alignment to create a tree, which worked fine and shows no es_ duplicates.

So what could be happening with the full assemblies? I've run three different subsets of the data starting each pipeline from scratch, and all three have es_'s in the resulting mafft-nexus files.

Given the size of the SqCL probe set, it's not reasonable to run each locus individually. Is there another path that would allow a full assembly with matrix options?

brantfaircloth commented 4 months ago

I'm not sure. If it's not happening when you run a locus manually, I have no idea why it's happening otherwise. And without any data for me to look at to try and repeat the problem, I'm just guessing.

This is an uncommon issue that you are seeing, and the pipeline runs software tests to make sure all is functioning well. Those tests are passing, and no one else is having this problem, so it suggests the issue has something to do with your data/setup and/or how some data are being integrated. Along the similar lines, the SqCL bait set really has nothing to do with me or phyluce, except that a large majority of loci in that bait set are UCEs that we've previously identified. I'm not sure that's the cause of the issue, but if you are using SqCL loci, phyluce is not specifically designed to process those.

The easiest way forward would be for you to try and figure out where/how the "es" is getting in there by finding what repeats the issue. What does the sequence/alignment look like for those individuals having the "es" relative to their "normal" sequence/alignment? At what stage is the "es_" being added, etc. If you can work backwards from there, you can likely figure out what the cause is.

Maybe work with a smaller set of a few loci, at first - ones where you've seen the problem. This will give you a nice little test set to figure things out with, rather than having to work with a large set of loci.

bbessesen commented 4 months ago

Ok, I'll give that a try. Thanks for your help!

faircloth-lab / phyluce

Adding SRR outgroups to the Phyluce pipeline #342