faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Illumiprocessor not removing adapters and barcodes? #287

Closed jbernst closed 1 year ago

jbernst commented 1 year ago

Hi all,

Has anybody had problems with illumiprocessor not removing adapters and barcodes? I ran phyluce v1.7.1 straight through to the tree building step with RAxML-NG, and ran into no issues. This is a dataset of UCEs (5k probe set) on 141 vertebrate samples from the same family:

I assembled my sequences using SPADES under default parameters in Phyluce. I've double checked my illumiprocessor.conf to ensure that the sample IDs and numbers, adapter and barcode sequences, and just the general format of the config file are all correct. I can't spot any issues, but I will attach it here (as a .txt, not a .conf for Github); perhaps I am overlooking something?

I was investigating my data because I have parts of the population data (the 125) samples that have incredibly long branches but little missing data. In an alignment viewer I noticed that there are regions of some loci that have strings of 50-100 or so nucleotides that are identical between individuals of geographically distant populations, and even outgroup and ingroup taxa. This is causing the outgroups to be placed within the tree or lump together populations that seem more artefactual rather than preliminary evidence of mixing populations. I even tried filtering out loci with >10% parsimony informative sites, reducing my dataset from ~4800 loci to ~1200 loci and still got the same exact results.

I used grep with parts of my adapter sequences and found that most of my samples, whether they are showing long branches or not, still have adapters in them. Any advice would be a huge help, and I am happy to share more files through email. If I forgot any pertinent information here that would be helpful, please let me know.

Thank you!

Best, Justin

illumiprocessor_fresh_GH.txt

brantfaircloth commented 1 year ago

Have you searched the putatively trimmed fastq files for remaining adapter sequences? I usually do that with portions of the adapter sequence on either side of the asterisk to make sure these have been removed.

jbernst commented 1 year ago

To do that, would a grep command on the fastq.gz files work? Or would I need to unzip them? I just did a test on one sample and did not find the chunk of the adapter that I found in that same sequence post assembly step (which is a bit confusing). I tried to grep a file that I know had adapter contamination in it by using the fastq.gz file that was output from illumiprocessor, and resulted in nothing being found (whether I had it zipped or not).

However, the contigs from SPADES assemblies (*.contigs.fasta), and subsequent steps (e.g., [individual_ID].unaligned.fasta after you explode the monolithic fasta file) have ~1600 hits for adapters, and that is just one part of the i5 adapter (and oddly enough, it is usually the i5 adapter that is the issue).

Also, would this adapter contamination greatly affect the resulting loci and alignments (this screwing up tree inference)? I presume so but wanted to check anyways. Thank you.

brantfaircloth commented 1 year ago

You have a couple of options. I usually use gunzip -c <filename> | less then search within less for the adapter fragments, but you could also use gunzip -c <filename> | grep <fragment>. That said, if you are not finding adapter fragments in your cleaned reads, then that suggests something else went wrong when you ran the spades assembly (i.e., spades is not going to add those sequences to assembled contigs for no reason - are you sure you input correctly trimmed reads versus the initial, raw reads?).

I'm not sure of the effects of adapter contamination on downstream inference, although I do expect there could be some.

jbernst commented 1 year ago

Just to confirm, I will want to be using grep on the split-adapter-quality-trimmed directory, containing the [samples].READ1.fastq.gz, [samples].READ2.fastq.gz, and [samples].fastq.gz files, right? I will run those and double-check. If those don't have any hits, I will rerun SPADES on a few samples that I have confirmed adapter contamination and check my scripts (and report it back here so we can either close this or continue figuring this out).

brantfaircloth commented 1 year ago

Yes.

jbernst commented 1 year ago

Hi Brant,

So I tried this out on three samples that I have confirmed there are adapters within the [samples].contigs.fasta files post-assembly. I went to the split-adapter-quality-trimmed directory for them and did the command you wrote, and nothing came up. I even tried doing it with less on one of them, and still got nothing. Yet in the contigs directory after I run spades, there are adapters in the [samples].contigs.fasta sequences. I don't understand how this is possible.

brantfaircloth commented 1 year ago

I'm not sure what's happening - perhaps the search you are running is not correctly finding whatever adapter contamination might be in your trimmed files. Spades is very unlikely to be adding sequence to the assemblies, so the most parsimonious explanation remains that some adapter contamination made it through to your assembled contigs.

I just validated that illumiprocessor is working as expected on both the test data (tests are run every time the repository is updated for illumiprocessor - those tests passed) and "real-life" data from the phyluce tutorial. As additional validation, I ran what I suggested to you against the raw and "clean" data for the alligator fastq file from the tutorial.

raw-data:

Screenshot 2023-01-18 at 15 04 21

clean-data:

Screenshot 2023-01-18 at 15 05 20

-b

jbernst commented 1 year ago

I will rerun the trimming and assembly steps again on a subset of samples that I know have contaminations. I hope you don't mind leaving this thread open. I will post my results by next week (I want to get a few samples going in the pipeline). I am sure the issue is on my end.

Thank you for running the test and "real-life" data to confirm, I appreciate it!

brantfaircloth commented 1 year ago

you could also skip the middleman and trim the files using trimmomatic, directly. in that case, the adapters.fasta file that you could use would look like (this is just the inner portion of the adapters):

>i7
ATCTCGTATGCCGTCTTCTGCTTG
>i5
GTGTAGATCTCGGTGGTCGCCGTATCATT

And trimmomatic call would look like:

java -jar trimmomatic-0.39.jar PE -threads <cores> \
         raw-data.R1.fq.gz \
         raw-data.R2.fq.gz \
         clean-data.READ1.trimmed.fastq.gz \
         clean-data.READ1.trimmed.unpaired.fastq.gz \
         clean-data.READ2.trimmed.fastq.gz \
         clean-data.READ2.trimmed.unpaired.fastq.gz \
         ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40
jbernst commented 1 year ago

I will try a few things. For the adapters.fasta file, mine reads:

>i5
AATGATACGGCGACCACCGAGATCTACACGTGGCTACACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTAGCCAATCTCGTATGCCGTCTTCTGCTT

Should there be the asterisk in there? As in:

>i5
AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT

The adapter info the company gave me is: Full sequences: 5’ 8bp i5 indexed adapter 5'AATGATACGGCGACCACCGAGATCTACAC[i5 index]ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3’ 8bp i7 indexed adapter 3'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7 index]ATCTCGTATGCCGTCTTCTGCTT

brantfaircloth commented 1 year ago

No, the asterisk should not be in there. Your adapters.fasta is correct as long as it contains the correct indexes for your sample (and the sequence in each matches the actual sequence in your adapter files - sometimes it may not because the orientation of the adapter sequence (versus what is actually in the data) can be confusing.

The example I sent also works perfectly fine - and it is agnostic to whatever index sequence(s) you used.

jbernst commented 1 year ago

Okay, great. And one last question as I start to run these test subsets, did the illumiprocessor file I attached in the initial post look normal, at least in the sense of format? E.g., the adapter sequences in there I have the asterisks, and I have 'i7' and 'i5' for the two barcodes per samples (for R1 and R2)

brantfaircloth commented 1 year ago

Seemed fine but I didn't spend a lot of time with it. I'd just take one of the resulting adapters.fasta files and make sure that the sequence it constructed matches the sequences that actually appear in your R1 and R2 files.

jbernst commented 1 year ago

Alright, I'll check. Will post today or tomorrow (or Monday the latest if this takes a bit, but I am only doing a few samples so I do not anticipate that). Thank you!

jbernst commented 1 year ago

Hi Brant,

So here is what I did with my test run and the results:

Adapters

i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGAT

I will be referring to the left and right side of the asterisk of each adapter is i7 parts 1 and 2 (i7p1 & i7p2) and i5 parts 1 and 2 (i5p1 & i5p2)

When I checked on all [sample].contigs.fasta files post-assembly (using grep to search for i7p1, i7p2, i5p1, and i5p2: i7p1 = 0 hits i7p2 = 0 hits i5p1 = 951 hits i5p2 = 53,750 hits

For the rest of this post, I will focus on what I did rerunning illumiprocessor and SPADES on a subsample of the data (5 samples):

Test data

007_R1.fastq.gz
007_R2.fastq.gz
008_R1.fastq.gz
008_R2.fastq.gz
010_R1.fastq.gz
010_R2.fastq.gz
044_R1.fastq.gz
044_R2.fastq.gz
138_R1.fastq.gz
138_R2.fastq.gz

Adapter check for raw data

I unzipped each file and used grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC CS_NCSM99002-SL-KH.contigs.fasta | wc -l and then repeated this for i7p2, i5p1, and i5p2.

Results: i7p1 / i7p2 / i5p1 / i5p2

007_R1.fastq = 39593 / 32594 / 0 / 0
007_R2.fastq = 0 / 0 / 0 / 0
008_R1.fastq = 78163 / 5270 / 0 / 2
008_R2.fastq = 0 / 0 / 0 / 0
010_R1.fastq = 21713 / 1901 / 0 / 0
010_R2.fastq = 21713 / 1901 / 0 / 0
044_R1.fastq = 15102 / 7115 / 0 / 0
044_R2.fastq = 0 / 0 / 0 / 0
138_R1.fastq = 17019 / 19 / 0 / 0
138_R2.fastq = 0 / 0 / 0 / 0

I am not sure why I was no hits for some of these.

Adapter check post-illumiprocessor

All instances of grep for i7p1, i7p2, i5p1, i5p2 = 0

I then ran SPADES under default parameters.

Adapter check after SPADES

Below is the config file:

[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT

# this is the list of indexes we used
[tag sequences]
i5-7330-JK-5_007-1:TATTGCTT
i7-7330-JK-5_007-2:AAGCAATA
i5-7330-JK-5_008-1:AATTGAGA
i7-7330-JK-5_008-2:TTACAACG
i5-7330-JK-5_010-1:AAGGCTGT
i7-7330-JK-5_010-2:TTGCAGTA
i5-7330-JK-5_044-1:TATTATCC
i7-7330-JK-5_044-2:GGATAATA
i5-7330-JK-5_138-1:GTAGCCAC
i7-7330-JK-5_138-2:CCTAGCCA

# this is how each index maps to each set of reads
[tag map]
7330-JK-5_007:i5-7330-JK-5_007-1,i7-7330-JK-5_007-2
7330-JK-5_008:i5-7330-JK-5_008-1,i7-7330-JK-5_008-2
7330-JK-5_010:i5-7330-JK-5_010-1,i7-7330-JK-5_010-2
7330-JK-5_044:i5-7330-JK-5_044-1,i7-7330-JK-5_044-2
7330-JK-5_138:i5-7330-JK-5_138-1,i7-7330-JK-5_138-2

# we want to rename our read files something a bit more nice - so we will
# rename Alligator_mississippiensis_GGAGCTATGG to alligator_mississippiensis
[names]
7330-JK-5_007:CA_XXXX91002-SL-HK
7330-JK-5_008:CA_XXX41586-SL-MM-Y
7330-JK-5_010:CB_XXXX179317-SL-PH-nozul
7330-JK-5_044:CA_XX20261-SL-CAIP6
7330-JK-5_138:HA_XXXXX20348

SPADES ran without issues (but ran out of time for sample 010, so results will be 'NA'). I used the same grep command above to search for adapter parts.

Results: i7p1 / i7p2 / i5p1 / i5p2

7330-JK-5_007: 0 / 0 / 12 / 451
7330-JK-5_008: 0 / 0 / 27 / 1618
7330-JK-5_010: NA / NA / NA / NA
7330-JK-5_044: 0 / 0 / 6 / 125
7330-JK-5_138: 0 / 0 /  1 / 176

So now grep is showing hits for i5 adapters post-SPADES (even though illumiprocessor seemed to have gotten rid of all of the adapter sequences).

I just found out that the barcodes for my demultiplexing got messed up by the company (they have just sent me a new file); do you think that by having the wrong barcodes associated with samples, that would affect adapter trimming? Or would illumiprocessor still get rid of the adapters (and just not the barcodes)?

Thank you!

brantfaircloth commented 1 year ago

I think it's possible that the barcode/index issue might be causing you some problems, but it's really hard to say. The other thing you always need to do is to ensure that the sequence string for each adapter you are constructing is actually correct. This is complicated because of the confusing way that Illumina deals with barcodes and how different sequencing providers request and return your index sequences. I always search for at least several of the i5 and i7 sequences that Illumiprocessor will construct to make sure they appear in my libraries.

As for how the adapters might be showing up in spades - I have a guess. Basically, grep is going to fail to find adapters in reads where there is a sequencing error in the adapter contamination that is being sequenced (because grep is searching for 100% match). Spades, on the other hand, corrects sequencing errors as part of it's processing steps - so it could be that trimmomatic is missing some of these error-containing adapter fragments, as well; then those are making it through; getting correct by spades to the sequence you expect; then showing up in your assemblies. I'm not sure this is what is happening, but it's a guess.

To see if this might be the case, it would not hurt to try trimmomatic, manually, with the adapters.fa file (and your raw reads). I would probably run this, and adjust the following:

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

specifically, the <seed mismatches> component. That is set to 2 in Illumiprocessor, as in:

java -jar trimmomatic-0.39.jar PE -threads <cores> \
         raw-data.R1.fq.gz \
         raw-data.R2.fq.gz \
         clean-data.READ1.trimmed.fastq.gz \
         clean-data.READ1.trimmed.unpaired.fastq.gz \
         clean-data.READ2.trimmed.fastq.gz \
         clean-data.READ2.trimmed.unpaired.fastq.gz \
         ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40

but you might try a larger number, like 4:

java -jar trimmomatic-0.39.jar PE -threads <cores> \
         raw-data.R1.fq.gz \
         raw-data.R2.fq.gz \
         clean-data.READ1.trimmed.fastq.gz \
         clean-data.READ1.trimmed.unpaired.fastq.gz \
         clean-data.READ2.trimmed.fastq.gz \
         clean-data.READ2.trimmed.unpaired.fastq.gz \
         ILLUMINACLIP:adapters.fa:4:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40

Then re-assemble that library you are testing with.

It might also be worthwhile to align your trimmed reads for a sample where you see adapter sequence in the contigs back to your assembled contigs, then determine which individual reads are causing you problems... and what those reads might represent. At least that helps you figure out what reads are (potentially) causing the problem.

jbernst commented 1 year ago

Thanks again for the suggestion. I will try using trimmomatic manually with the adapters file. So I will run the above command (and I will try it with 2 and 4) for a few of these files that I mentioned above.

Are the following just names for the output files? (i.e., I am just giving them a name?):

clean-data.READ1.trimmed.fastq.gz \
clean-data.READ1.trimmed.unpaired.fastq.gz \
clean-data.READ2.trimmed.fastq.gz \
clean-data.READ2.trimmed.unpaired.fastq.gz \

Then after this rerun SPADES and see if the outputs have any adapters?

PS - It turns out the company sent the wrong demultiplex file...so that fixed the weird topology and branch lengths of the phylogeny. But it didn't change the fact that some adapters (and I bet barcodes) are making it through, which won't be good for downstream analysis.

brantfaircloth commented 1 year ago

Yes, those are just the output file names.

jbernst commented 1 year ago

Thank you! I will report what I find here.

jbernst commented 1 year ago

One quick thing that I just noticed, which definitely seems like it would be an issue on my part. Before I try trimmomatic, I wanted to check something with you. I was grabbing the adapter.fasta files that are generated and noticed something:

# will be replaced with the appropriate index for the sample.
[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT

# this is the list of indexes we used
[tag sequences]
i5-7330-JK-5_007-1:TATTGCTT = NA (sequence in adapter file is the i7 instead: AAGCAATA)
i7-7330-JK-5_007-2:AAGCAATA = HIT
i5-7330-JK-5_008-1:AATTGAGA = NA (adapter file sequence, TCTCAATT, is not even a barcode amongst any of my samples)
i7-7330-JK-5_008-2:TTACAACG = HIT
i5-7330-JK-5_010-1:AAGGCTGT = NA (adapter file shows ACAGCCTT instead, which is an i7 index of a sample not within this subset of 5 samples)

i7-7330-JK-5_010-2:TTGCAGTA = HIT
i5-7330-JK-5_044-1:TATTATCC = NA (sequence in adapter file is the i7 instead: GGATAATA)
i7-7330-JK-5_044-2:GGATAATA = HIT
i5-7330-JK-5_138-1:GTAGCCAC = NA (adapter file sequence, GTGGCTAC, is not even a barcode amongst any of my samples)
i7-7330-JK-5_138-2:CCTAGCCA = HIT

# this is how each index maps to each set of reads
[tag map]
7330-JK-5_007:i5-7330-JK-5_007-1,i7-7330-JK-5_007-2
7330-JK-5_008:i5-7330-JK-5_008-1,i7-7330-JK-5_008-2
7330-JK-5_010:i5-7330-JK-5_010-1,i7-7330-JK-5_010-2
7330-JK-5_044:i5-7330-JK-5_044-1,i7-7330-JK-5_044-2
7330-JK-5_138:i5-7330-JK-5_138-1,i7-7330-JK-5_138-2

# we want to rename our read files something a bit more nice - so we will
# rename Alligator_mississippiensis_GGAGCTATGG to alligator_mississippiensis
[names]
7330-JK-5_007:CA_XXXX91002-SL-HK
7330-JK-5_008:CA_XXX41586-SL-MM-Y
7330-JK-5_010:CB_XXXX179317-SL-PH-nozul
7330-JK-5_044:CA_XX20261-SL-CAIP6
7330-JK-5_138:HA_XXXXX20348

This would explain why during my 'test' I did before on the five samples, the i7 adapters were cleaned out, but not the i5 adapters. It looks like the i5 adapters in the adapter.fasta files have changed somehow. I double-checked all five of the samples in my subset (above), and the barcode that is in place of the asterisk changes. For example, for samples ID i5-7330-JK-5_007, the adapter.fasta file is as follows (the asterisks in the file are not actually there, that is just to help you visualize what set of 8 nucleotides were in that position):

>i5
AATGATACGGCGACCACCGAGATCTACAC**AAGCAATA**ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC**AAGCAATA**ATCTCGTATGCCGTCTTCTGCTT

The other examples I added a bit of an explanation of what I saw (so the i5 adapter either 'changed' to the i7 adapter, or was simply a completely different the of 8 nucleotides then from what I put in the .config file. I would assume this means that illumiprocessor is doing great at the i7 adapter removal but something is inhibiting the i5 adapter removal due to these erroneous sequences.

Does my format look correct? It is a text file that is a straight copy and paste of what I have put above in this post. Am I using appropriate syntax here (i.e., putting i7 and i5 at the beginning of lines, commas, etc.).

brantfaircloth commented 1 year ago

Looks like the adapters that are in your data are the reverse complement of what you think they are. e.g.:

i5-7330-JK-5_007-1:TATTGCTT = NA (sequence in adapter file is the i7 instead: AAGCAATA)

This is what I meant when I said you need to make sure that what you are telling Illumiprocessor to trim is what is actually in the sequence (not simply what the adapter sequence is). This is also confusing because Illumina uses the "adapter sequence" in different orientations.

e.g. also see:

i5-7330-JK-5_008-1:AATTGAGA = NA (adapter file sequence, TCTCAATT, is not even a barcode amongst any of my samples)

TCTCAATT is the revcomp of AATTGAGA

jbernst commented 1 year ago

I see. It looks like that is the case for all five of these (I just double-checked)

So would this be a matter of talking to the sequencing company? Is it standard for one to need to determine the reverse complement of what the adapter sequence is vs. what is actually in the strand? They sent me a demultiplex file that has every same, the sequencing info and file name, and then the i7 and i5 'barcode' for each sequence.

brantfaircloth commented 1 year ago

It’s up to you to figure out how the index sequence is likely to be in your reads (as adapter contamination) so you can trim it. The Adapterama file (if you used our indexes) contains forward and reverse complements of all adapters. You can also skip searching for the index altogether (as I mentioned above), but you’ll have to trim all your files manually (not hard with some bash scripting).

jbernst commented 1 year ago

I think I am close to figuring this out. I spoke to the sequencing company and there are reverse complement orientations to my index barcodes. I just checked my uce alignments post-edge and post-internal trimming, and only one single hit for barcode was found. Given this, I think it is safe for me to just use sed or manually remove them, right? I was thinking of doing searched and removing the following:

>i5
AATGATACGGCGACCACCGAGATCTACAC*[reverse complement index]*ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>i7
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*[reverse complement index]*ATCTCGTATGCCGTCTTCTGCTT

I will also remove the left and right sequences around the adapter. And also doing the reverse complement of the barcodes themselves (I checked, and there are indeed entire adapters in there of the reverse complement).

So when you mention bash scripting, does sed sound like an ideal way for me to go about this? On individual alignments? I just want to make sure you don't think I should do this on a part prior to the alignments and cleaning of alignments.

Also, here is what the sequencing company told me. I am posting this information here in case anyone ever has a similar scenario I am having. I got my sequences through Arbor Biosciences, so I cannot speak for other companies:

Different Illumina instruments use different workflows. The Forward Strand Workflow requires the i5 in the forward orientation and the Reverse Complement Workflow requires the reverse complemented i5. The NovaSeq v1 chemistry (pre-November 2020) used the Forward Strand Workflow. The v1.5 chemistry was released in November of 2020, and so the NovaSeq was switched to the Reverse Complement Workflow. Some sequencing facilities still require that the indexes are provided in the Forward Strand orientation because they do a screen on the MiSeq first, which uses the Forward Strand Workflow. So, the indexes that are provided are in the Forward Strand Workflow orientation. Link to workflows here.

brantfaircloth commented 1 year ago

The only thing that will make the approach you propose tricky is that the adapters + barcodes are not always sequenced perfectly... so you can miss matches that contain errors (sed, grep, and others search for "perfect" matches.

There are several ways around this, but one way to work around would be to trim your contigs for adapter sequences prior to aligning them. You could possibly do this with something like trimmomatic, although a tool meant for longer sequences (like poretools or similar) might be better. All of that said, appropriately removing adapters from the starting reads is still likely the most optimal way forward.

The information Arbor provided for Illumina sequences is pretty standard stuff - different instruments (even from Illumina) work in slightly different ways, so you need to know what to look for in order to trim it correctly. You also need to have some knowledge of how the adapter sequences work and what actually gets sequenced, in what order (we provide lots of figures in our Adapterama paper, as well as some history of different adapter systems).

What I do is to always search the reads I receive (R1 and R2) for

ACACTCTTTCCCTACACGACGCTCTTCCGATCT
and
ATCTCGTATGCCGTCTTCTGCTT

Which are the 3' ends of each adapter sequence. Then, I look to see which index sequence is adjacent to those sequences and how that index sequence compares to the index sequence I expect. Once I know if I need to use the forward or reverse complement of the index I expect, I construct the config file for Illumiprocessor. After that, I've not had problems with adapter contamination.

jbernst commented 1 year ago

That's really helpful. I think the way for me to go about this and really ensure I do this correctly is to start this over and do trimmomatic using the reverse complements (which are clearly in the sequences). One last question:

In the config file, you insert something like:

 [adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTT
i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Can I insert more lines there to take into account reverse complements? Or would phyluce throw an error at me (so there would be four time, in which the reverse complement is also given for i7 and i5).

And I suppose the would go for the lines where you give the barcode index sequences for each sequences (4 lines, forward i7, forward i5, reverse complement i7, reverse complement i5).

brantfaircloth commented 1 year ago

You cannot include revcomps (for the entire adapter) when using illumiprocessor (I.e., you can only include 2 sequences). I'm also pretty sure that you do not need to bother with 4, given how whole genome libraries are prepared and sequenced on Illumina instruments (amplicons sequenced on Illumina are a slightly different story).

That said, you can include whatever adapter sequence(s) you like when using trimmomatic outside of Illumiprocessor, although more sequences searched for may slow things down a bit.

jbernst commented 1 year ago

Okay, I will just do a search and rerun everything using the correct orientation. I think at this point we can close this (but if you prefer me to report back and leave this open, I can do that too). Thanks again for all the suggestions!

brantfaircloth commented 1 year ago

I'll go ahead and close - but you can still add comments to a closed issue.

jbernst commented 1 year ago

I am just adding a comment in here to state that the reverse complementing of the i5 adapter and the associated i5 barcodes worked. I have checked and there are no hits for searching for any of the adapters or respective reverse complements.

Thank you!