Closed jbernst closed 1 year ago
Have you searched the putatively trimmed fastq files for remaining adapter sequences? I usually do that with portions of the adapter sequence on either side of the asterisk to make sure these have been removed.
To do that, would a grep command on the fastq.gz files work? Or would I need to unzip them? I just did a test on one sample and did not find the chunk of the adapter that I found in that same sequence post assembly step (which is a bit confusing). I tried to grep
a file that I know had adapter contamination in it by using the fastq.gz file that was output from illumiprocessor, and resulted in nothing being found (whether I had it zipped or not).
However, the contigs from SPADES assemblies (*.contigs.fasta), and subsequent steps (e.g., [individual_ID].unaligned.fasta after you explode the monolithic fasta file) have ~1600 hits for adapters, and that is just one part of the i5 adapter (and oddly enough, it is usually the i5 adapter that is the issue).
Also, would this adapter contamination greatly affect the resulting loci and alignments (this screwing up tree inference)? I presume so but wanted to check anyways. Thank you.
You have a couple of options. I usually use gunzip -c <filename> | less
then search within less
for the adapter fragments, but you could also use gunzip -c <filename> | grep <fragment>
. That said, if you are not finding adapter fragments in your cleaned reads, then that suggests something else went wrong when you ran the spades assembly (i.e., spades is not going to add those sequences to assembled contigs for no reason - are you sure you input correctly trimmed reads versus the initial, raw reads?).
I'm not sure of the effects of adapter contamination on downstream inference, although I do expect there could be some.
Just to confirm, I will want to be using grep
on the split-adapter-quality-trimmed directory, containing the [samples].READ1.fastq.gz, [samples].READ2.fastq.gz, and [samples].fastq.gz files, right? I will run those and double-check. If those don't have any hits, I will rerun SPADES on a few samples that I have confirmed adapter contamination and check my scripts (and report it back here so we can either close this or continue figuring this out).
Yes.
Hi Brant,
So I tried this out on three samples that I have confirmed there are adapters within the [samples].contigs.fasta files post-assembly. I went to the split-adapter-quality-trimmed
directory for them and did the command you wrote, and nothing came up. I even tried doing it with less
on one of them, and still got nothing. Yet in the contigs
directory after I run spades, there are adapters in the [samples].contigs.fasta sequences. I don't understand how this is possible.
I'm not sure what's happening - perhaps the search you are running is not correctly finding whatever adapter contamination might be in your trimmed files. Spades is very unlikely to be adding sequence to the assemblies, so the most parsimonious explanation remains that some adapter contamination made it through to your assembled contigs.
I just validated that illumiprocessor is working as expected on both the test data (tests are run every time the repository is updated for illumiprocessor - those tests passed) and "real-life" data from the phyluce tutorial. As additional validation, I ran what I suggested to you against the raw and "clean" data for the alligator fastq file from the tutorial.
raw-data:
clean-data:
-b
I will rerun the trimming and assembly steps again on a subset of samples that I know have contaminations. I hope you don't mind leaving this thread open. I will post my results by next week (I want to get a few samples going in the pipeline). I am sure the issue is on my end.
Thank you for running the test and "real-life" data to confirm, I appreciate it!
you could also skip the middleman and trim the files using trimmomatic, directly. in that case, the adapters.fasta
file that you could use would look like (this is just the inner portion of the adapters):
>i7
ATCTCGTATGCCGTCTTCTGCTTG
>i5
GTGTAGATCTCGGTGGTCGCCGTATCATT
And trimmomatic call would look like:
java -jar trimmomatic-0.39.jar PE -threads <cores> \
raw-data.R1.fq.gz \
raw-data.R2.fq.gz \
clean-data.READ1.trimmed.fastq.gz \
clean-data.READ1.trimmed.unpaired.fastq.gz \
clean-data.READ2.trimmed.fastq.gz \
clean-data.READ2.trimmed.unpaired.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40
I will try a few things. For the adapters.fasta
file, mine reads:
>i5
AATGATACGGCGACCACCGAGATCTACACGTGGCTACACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTAGCCAATCTCGTATGCCGTCTTCTGCTT
Should there be the asterisk in there? As in:
>i5
AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
The adapter info the company gave me is: Full sequences: 5’ 8bp i5 indexed adapter 5'AATGATACGGCGACCACCGAGATCTACAC[i5 index]ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3’ 8bp i7 indexed adapter 3'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7 index]ATCTCGTATGCCGTCTTCTGCTT
No, the asterisk should not be in there. Your adapters.fasta
is correct as long as it contains the correct indexes for your sample (and the sequence in each matches the actual sequence in your adapter files - sometimes it may not because the orientation of the adapter sequence (versus what is actually in the data) can be confusing.
The example I sent also works perfectly fine - and it is agnostic to whatever index sequence(s) you used.
Okay, great. And one last question as I start to run these test subsets, did the illumiprocessor file I attached in the initial post look normal, at least in the sense of format? E.g., the adapter sequences in there I have the asterisks, and I have 'i7' and 'i5' for the two barcodes per samples (for R1 and R2)
Seemed fine but I didn't spend a lot of time with it. I'd just take one of the resulting adapters.fasta
files and make sure that the sequence it constructed matches the sequences that actually appear in your R1 and R2 files.
Alright, I'll check. Will post today or tomorrow (or Monday the latest if this takes a bit, but I am only doing a few samples so I do not anticipate that). Thank you!
Hi Brant,
So here is what I did with my test run and the results:
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGAT
I will be referring to the left and right side of the asterisk of each adapter is i7 parts 1 and 2 (i7p1 & i7p2) and i5 parts 1 and 2 (i5p1 & i5p2)
When I checked on all [sample].contigs.fasta files post-assembly (using grep
to search for i7p1, i7p2, i5p1, and i5p2:
i7p1 = 0 hits
i7p2 = 0 hits
i5p1 = 951 hits
i5p2 = 53,750 hits
For the rest of this post, I will focus on what I did rerunning illumiprocessor and SPADES on a subsample of the data (5 samples):
007_R1.fastq.gz
007_R2.fastq.gz
008_R1.fastq.gz
008_R2.fastq.gz
010_R1.fastq.gz
010_R2.fastq.gz
044_R1.fastq.gz
044_R2.fastq.gz
138_R1.fastq.gz
138_R2.fastq.gz
I unzipped each file and used grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC CS_NCSM99002-SL-KH.contigs.fasta | wc -l
and then repeated this for i7p2, i5p1, and i5p2.
Results: i7p1 / i7p2 / i5p1 / i5p2
007_R1.fastq = 39593 / 32594 / 0 / 0
007_R2.fastq = 0 / 0 / 0 / 0
008_R1.fastq = 78163 / 5270 / 0 / 2
008_R2.fastq = 0 / 0 / 0 / 0
010_R1.fastq = 21713 / 1901 / 0 / 0
010_R2.fastq = 21713 / 1901 / 0 / 0
044_R1.fastq = 15102 / 7115 / 0 / 0
044_R2.fastq = 0 / 0 / 0 / 0
138_R1.fastq = 17019 / 19 / 0 / 0
138_R2.fastq = 0 / 0 / 0 / 0
I am not sure why I was no hits for some of these.
illumiprocessor
All instances of grep
for i7p1, i7p2, i5p1, i5p2 = 0
I then ran SPADES under default parameters.
SPADES
Below is the config file:
[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
# this is the list of indexes we used
[tag sequences]
i5-7330-JK-5_007-1:TATTGCTT
i7-7330-JK-5_007-2:AAGCAATA
i5-7330-JK-5_008-1:AATTGAGA
i7-7330-JK-5_008-2:TTACAACG
i5-7330-JK-5_010-1:AAGGCTGT
i7-7330-JK-5_010-2:TTGCAGTA
i5-7330-JK-5_044-1:TATTATCC
i7-7330-JK-5_044-2:GGATAATA
i5-7330-JK-5_138-1:GTAGCCAC
i7-7330-JK-5_138-2:CCTAGCCA
# this is how each index maps to each set of reads
[tag map]
7330-JK-5_007:i5-7330-JK-5_007-1,i7-7330-JK-5_007-2
7330-JK-5_008:i5-7330-JK-5_008-1,i7-7330-JK-5_008-2
7330-JK-5_010:i5-7330-JK-5_010-1,i7-7330-JK-5_010-2
7330-JK-5_044:i5-7330-JK-5_044-1,i7-7330-JK-5_044-2
7330-JK-5_138:i5-7330-JK-5_138-1,i7-7330-JK-5_138-2
# we want to rename our read files something a bit more nice - so we will
# rename Alligator_mississippiensis_GGAGCTATGG to alligator_mississippiensis
[names]
7330-JK-5_007:CA_XXXX91002-SL-HK
7330-JK-5_008:CA_XXX41586-SL-MM-Y
7330-JK-5_010:CB_XXXX179317-SL-PH-nozul
7330-JK-5_044:CA_XX20261-SL-CAIP6
7330-JK-5_138:HA_XXXXX20348
SPADES ran without issues (but ran out of time for sample 010, so results will be 'NA'). I used the same grep
command above to search for adapter parts.
Results: i7p1 / i7p2 / i5p1 / i5p2
7330-JK-5_007: 0 / 0 / 12 / 451
7330-JK-5_008: 0 / 0 / 27 / 1618
7330-JK-5_010: NA / NA / NA / NA
7330-JK-5_044: 0 / 0 / 6 / 125
7330-JK-5_138: 0 / 0 / 1 / 176
So now grep
is showing hits for i5 adapters post-SPADES (even though illumiprocessor seemed to have gotten rid of all of the adapter sequences).
I just found out that the barcodes for my demultiplexing got messed up by the company (they have just sent me a new file); do you think that by having the wrong barcodes associated with samples, that would affect adapter trimming? Or would illumiprocessor still get rid of the adapters (and just not the barcodes)?
Thank you!
I think it's possible that the barcode/index issue might be causing you some problems, but it's really hard to say. The other thing you always need to do is to ensure that the sequence string for each adapter you are constructing is actually correct. This is complicated because of the confusing way that Illumina deals with barcodes and how different sequencing providers request and return your index sequences. I always search for at least several of the i5 and i7 sequences that Illumiprocessor will construct to make sure they appear in my libraries.
As for how the adapters might be showing up in spades - I have a guess. Basically, grep is going to fail to find adapters in reads where there is a sequencing error in the adapter contamination that is being sequenced (because grep is searching for 100% match). Spades, on the other hand, corrects sequencing errors as part of it's processing steps - so it could be that trimmomatic is missing some of these error-containing adapter fragments, as well; then those are making it through; getting correct by spades to the sequence you expect; then showing up in your assemblies. I'm not sure this is what is happening, but it's a guess.
To see if this might be the case, it would not hurt to try trimmomatic, manually, with the adapters.fa file (and your raw reads). I would probably run this, and adjust the following:
ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
specifically, the <seed mismatches>
component. That is set to 2 in Illumiprocessor, as in:
java -jar trimmomatic-0.39.jar PE -threads <cores> \
raw-data.R1.fq.gz \
raw-data.R2.fq.gz \
clean-data.READ1.trimmed.fastq.gz \
clean-data.READ1.trimmed.unpaired.fastq.gz \
clean-data.READ2.trimmed.fastq.gz \
clean-data.READ2.trimmed.unpaired.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40
but you might try a larger number, like 4:
java -jar trimmomatic-0.39.jar PE -threads <cores> \
raw-data.R1.fq.gz \
raw-data.R2.fq.gz \
clean-data.READ1.trimmed.fastq.gz \
clean-data.READ1.trimmed.unpaired.fastq.gz \
clean-data.READ2.trimmed.fastq.gz \
clean-data.READ2.trimmed.unpaired.fastq.gz \
ILLUMINACLIP:adapters.fa:4:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40
Then re-assemble that library you are testing with.
It might also be worthwhile to align your trimmed reads for a sample where you see adapter sequence in the contigs back to your assembled contigs, then determine which individual reads are causing you problems... and what those reads might represent. At least that helps you figure out what reads are (potentially) causing the problem.
Thanks again for the suggestion. I will try using trimmomatic manually with the adapters file. So I will run the above command (and I will try it with 2 and 4) for a few of these files that I mentioned above.
Are the following just names for the output files? (i.e., I am just giving them a name?):
clean-data.READ1.trimmed.fastq.gz \
clean-data.READ1.trimmed.unpaired.fastq.gz \
clean-data.READ2.trimmed.fastq.gz \
clean-data.READ2.trimmed.unpaired.fastq.gz \
Then after this rerun SPADES and see if the outputs have any adapters?
PS - It turns out the company sent the wrong demultiplex file...so that fixed the weird topology and branch lengths of the phylogeny. But it didn't change the fact that some adapters (and I bet barcodes) are making it through, which won't be good for downstream analysis.
Yes, those are just the output file names.
Thank you! I will report what I find here.
One quick thing that I just noticed, which definitely seems like it would be an issue on my part. Before I try trimmomatic, I wanted to check something with you. I was grabbing the adapter.fasta
files that are generated and noticed something:
# will be replaced with the appropriate index for the sample.
[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
# this is the list of indexes we used
[tag sequences]
i5-7330-JK-5_007-1:TATTGCTT = NA (sequence in adapter file is the i7 instead: AAGCAATA)
i7-7330-JK-5_007-2:AAGCAATA = HIT
i5-7330-JK-5_008-1:AATTGAGA = NA (adapter file sequence, TCTCAATT, is not even a barcode amongst any of my samples)
i7-7330-JK-5_008-2:TTACAACG = HIT
i5-7330-JK-5_010-1:AAGGCTGT = NA (adapter file shows ACAGCCTT instead, which is an i7 index of a sample not within this subset of 5 samples)
i7-7330-JK-5_010-2:TTGCAGTA = HIT
i5-7330-JK-5_044-1:TATTATCC = NA (sequence in adapter file is the i7 instead: GGATAATA)
i7-7330-JK-5_044-2:GGATAATA = HIT
i5-7330-JK-5_138-1:GTAGCCAC = NA (adapter file sequence, GTGGCTAC, is not even a barcode amongst any of my samples)
i7-7330-JK-5_138-2:CCTAGCCA = HIT
# this is how each index maps to each set of reads
[tag map]
7330-JK-5_007:i5-7330-JK-5_007-1,i7-7330-JK-5_007-2
7330-JK-5_008:i5-7330-JK-5_008-1,i7-7330-JK-5_008-2
7330-JK-5_010:i5-7330-JK-5_010-1,i7-7330-JK-5_010-2
7330-JK-5_044:i5-7330-JK-5_044-1,i7-7330-JK-5_044-2
7330-JK-5_138:i5-7330-JK-5_138-1,i7-7330-JK-5_138-2
# we want to rename our read files something a bit more nice - so we will
# rename Alligator_mississippiensis_GGAGCTATGG to alligator_mississippiensis
[names]
7330-JK-5_007:CA_XXXX91002-SL-HK
7330-JK-5_008:CA_XXX41586-SL-MM-Y
7330-JK-5_010:CB_XXXX179317-SL-PH-nozul
7330-JK-5_044:CA_XX20261-SL-CAIP6
7330-JK-5_138:HA_XXXXX20348
This would explain why during my 'test' I did before on the five samples, the i7 adapters were cleaned out, but not the i5 adapters. It looks like the i5 adapters in the adapter.fasta
files have changed somehow. I double-checked all five of the samples in my subset (above), and the barcode that is in place of the asterisk changes. For example, for samples ID i5-7330-JK-5_007, the adapter.fasta
file is as follows (the asterisks in the file are not actually there, that is just to help you visualize what set of 8 nucleotides were in that position):
>i5
AATGATACGGCGACCACCGAGATCTACAC**AAGCAATA**ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC**AAGCAATA**ATCTCGTATGCCGTCTTCTGCTT
The other examples I added a bit of an explanation of what I saw (so the i5 adapter either 'changed' to the i7 adapter, or was simply a completely different the of 8 nucleotides then from what I put in the .config file. I would assume this means that illumiprocessor is doing great at the i7 adapter removal but something is inhibiting the i5 adapter removal due to these erroneous sequences.
Does my format look correct? It is a text file that is a straight copy and paste of what I have put above in this post. Am I using appropriate syntax here (i.e., putting i7 and i5 at the beginning of lines, commas, etc.).
Looks like the adapters that are in your data are the reverse complement of what you think they are. e.g.:
i5-7330-JK-5_007-1:TATTGCTT = NA (sequence in adapter file is the i7 instead: AAGCAATA)
This is what I meant when I said you need to make sure that what you are telling Illumiprocessor to trim is what is actually in the sequence (not simply what the adapter sequence is). This is also confusing because Illumina uses the "adapter sequence" in different orientations.
e.g. also see:
i5-7330-JK-5_008-1:AATTGAGA = NA (adapter file sequence, TCTCAATT, is not even a barcode amongst any of my samples)
TCTCAATT
is the revcomp of AATTGAGA
I see. It looks like that is the case for all five of these (I just double-checked)
So would this be a matter of talking to the sequencing company? Is it standard for one to need to determine the reverse complement of what the adapter sequence is vs. what is actually in the strand? They sent me a demultiplex file that has every same, the sequencing info and file name, and then the i7 and i5 'barcode' for each sequence.
It’s up to you to figure out how the index sequence is likely to be in your reads (as adapter contamination) so you can trim it. The Adapterama file (if you used our indexes) contains forward and reverse complements of all adapters. You can also skip searching for the index altogether (as I mentioned above), but you’ll have to trim all your files manually (not hard with some bash scripting).
I think I am close to figuring this out. I spoke to the sequencing company and there are reverse complement orientations to my index barcodes. I just checked my uce alignments post-edge and post-internal trimming, and only one single hit for barcode was found. Given this, I think it is safe for me to just use sed or manually remove them, right? I was thinking of doing searched and removing the following:
>i5
AATGATACGGCGACCACCGAGATCTACAC*[reverse complement index]*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*[reverse complement index]*ATCTCGTATGCCGTCTTCTGCTT
I will also remove the left and right sequences around the adapter. And also doing the reverse complement of the barcodes themselves (I checked, and there are indeed entire adapters in there of the reverse complement).
So when you mention bash scripting, does sed
sound like an ideal way for me to go about this? On individual alignments? I just want to make sure you don't think I should do this on a part prior to the alignments and cleaning of alignments.
Also, here is what the sequencing company told me. I am posting this information here in case anyone ever has a similar scenario I am having. I got my sequences through Arbor Biosciences, so I cannot speak for other companies:
Different Illumina instruments use different workflows. The Forward Strand Workflow requires the i5 in the forward orientation and the Reverse Complement Workflow requires the reverse complemented i5. The NovaSeq v1 chemistry (pre-November 2020) used the Forward Strand Workflow. The v1.5 chemistry was released in November of 2020, and so the NovaSeq was switched to the Reverse Complement Workflow. Some sequencing facilities still require that the indexes are provided in the Forward Strand orientation because they do a screen on the MiSeq first, which uses the Forward Strand Workflow. So, the indexes that are provided are in the Forward Strand Workflow orientation. Link to workflows here.
The only thing that will make the approach you propose tricky is that the adapters + barcodes are not always sequenced perfectly... so you can miss matches that contain errors (sed
, grep
, and others search for "perfect" matches.
There are several ways around this, but one way to work around would be to trim your contigs for adapter sequences prior to aligning them. You could possibly do this with something like trimmomatic, although a tool meant for longer sequences (like poretools or similar) might be better. All of that said, appropriately removing adapters from the starting reads is still likely the most optimal way forward.
The information Arbor provided for Illumina sequences is pretty standard stuff - different instruments (even from Illumina) work in slightly different ways, so you need to know what to look for in order to trim it correctly. You also need to have some knowledge of how the adapter sequences work and what actually gets sequenced, in what order (we provide lots of figures in our Adapterama paper, as well as some history of different adapter systems).
What I do is to always search the reads I receive (R1 and R2) for
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
and
ATCTCGTATGCCGTCTTCTGCTT
Which are the 3' ends of each adapter sequence. Then, I look to see which index sequence is adjacent to those sequences and how that index sequence compares to the index sequence I expect. Once I know if I need to use the forward or reverse complement of the index I expect, I construct the config file for Illumiprocessor. After that, I've not had problems with adapter contamination.
That's really helpful. I think the way for me to go about this and really ensure I do this correctly is to start this over and do trimmomatic using the reverse complements (which are clearly in the sequences). One last question:
In the config file, you insert something like:
[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Can I insert more lines there to take into account reverse complements? Or would phyluce throw an error at me (so there would be four time, in which the reverse complement is also given for i7 and i5).
And I suppose the would go for the lines where you give the barcode index sequences for each sequences (4 lines, forward i7, forward i5, reverse complement i7, reverse complement i5).
You cannot include revcomps (for the entire adapter) when using illumiprocessor (I.e., you can only include 2 sequences). I'm also pretty sure that you do not need to bother with 4, given how whole genome libraries are prepared and sequenced on Illumina instruments (amplicons sequenced on Illumina are a slightly different story).
That said, you can include whatever adapter sequence(s) you like when using trimmomatic outside of Illumiprocessor, although more sequences searched for may slow things down a bit.
Okay, I will just do a search and rerun everything using the correct orientation. I think at this point we can close this (but if you prefer me to report back and leave this open, I can do that too). Thanks again for all the suggestions!
I'll go ahead and close - but you can still add comments to a closed issue.
I am just adding a comment in here to state that the reverse complementing of the i5 adapter and the associated i5 barcodes worked. I have checked and there are no hits for searching for any of the adapters or respective reverse complements.
Thank you!
Hi all,
Has anybody had problems with illumiprocessor not removing adapters and barcodes? I ran phyluce v1.7.1 straight through to the tree building step with RAxML-NG, and ran into no issues. This is a dataset of UCEs (5k probe set) on 141 vertebrate samples from the same family:
I assembled my sequences using SPADES under default parameters in Phyluce. I've double checked my illumiprocessor.conf to ensure that the sample IDs and numbers, adapter and barcode sequences, and just the general format of the config file are all correct. I can't spot any issues, but I will attach it here (as a .txt, not a .conf for Github); perhaps I am overlooking something?
I was investigating my data because I have parts of the population data (the 125) samples that have incredibly long branches but little missing data. In an alignment viewer I noticed that there are regions of some loci that have strings of 50-100 or so nucleotides that are identical between individuals of geographically distant populations, and even outgroup and ingroup taxa. This is causing the outgroups to be placed within the tree or lump together populations that seem more artefactual rather than preliminary evidence of mixing populations. I even tried filtering out loci with >10% parsimony informative sites, reducing my dataset from ~4800 loci to ~1200 loci and still got the same exact results.
I used
grep
with parts of my adapter sequences and found that most of my samples, whether they are showing long branches or not, still have adapters in them. Any advice would be a huge help, and I am happy to share more files through email. If I forgot any pertinent information here that would be helpful, please let me know.Thank you!
Best, Justin
illumiprocessor_fresh_GH.txt