Questions for the adapterSeq parameter

weiwsmiling commented 3 years ago

Hi,

Thanks for the nice pipeline! I am trying to use this pipeline to analysis a Duplex-seq similar data. Not quite sure I understood what adapterSeq I should provide correctly. My reads consist of Adapter+UMI+spacer sequences from left to right of read1.fq. And the example in the README is "ANNNNNNNNAGATCGGAAGAG" which I think is spacer+UMI+Adapter. Should I reverse complement sequence of my Adapter+UMI+spacer information for configure csv file?

Thanks, Wei

bkohrn commented 3 years ago

Hi Wei,

The adapterSeq parameter is being passed to cutadapt; you can see cutadapt's documentation for more information (https://cutadapt.readthedocs.io/en/stable/). Depending on your adapter setup, you might want to use different adapters; the adapter is generally found at the end of the read (where the read has run into the adapter region of the opposing read).

Your instinct is correct; this is the spacer sequence (in this case, just the A-tailing), followed by the UMI (in this case, 8 bp of random sequence), and the adapter sequence (which for us is generally the standard Illumina adapter). In some cases, you may have multiple potential UMIs that you know, in which case you can provide a .fasta file of your potential UMIs. The adapter is not reverse-complimented (I actually got it from one of the back-end files in Fastqc), but the inclusion of the UMI and spacer sequences is important, as otherwise these can cause false mutations from alignment of part of the UMI to your reference.

One note: this pipeline will remove UMI information from both ends of the read, so if you only have UMI information in one read, you might loose some data from the pipeline extracting UMI information from the other read.

Hope that helps

Edit: If you do know what your UMI is (or what the options are), I'd suggest submitting a fasta file with both the reverse compliment and normal versions, as:

A-UMI-Adapter A-reverse_compliment_UMI-Adapter

weiwsmiling commented 3 years ago

Hi Wei,

The adapterSeq parameter is being passed to cutadapt; you can see cutadapt's documentation for more information (https://cutadapt.readthedocs.io/en/stable/). Depending on your adapter setup, you might want to use different adapters; the adapter is generally found at the end of the read (where the read has run into the adapter region of the opposing read).

Your instinct is correct; this is the spacer sequence (in this case, just the A-tailing), followed by the UMI (in this case, 8 bp of random sequence), and the adapter sequence (which for us is generally the standard Illumina adapter). In some cases, you may have multiple potential UMIs that you know, in which case you can provide a .fasta file of your potential UMIs. The adapter is not reverse-complimented (I actually got it from one of the back-end files in Fastqc), but the inclusion of the UMI and spacer sequences is important, as otherwise these can cause false mutations from alignment of part of the UMI to your reference.

One note: this pipeline will remove UMI information from both ends of the read, so if you only have UMI information in one read, you might loose some data from the pipeline extracting UMI information from the other read.

Hope that helps

Edit: If you do know what your UMI is (or what the options are), I'd suggest submitting a fasta file with both the reverse compliment and normal versions, as:

A-UMI-Adapter A-reverse_compliment_UMI-Adapter

Thank you very much for the detail information! For the spacer sequence, does it need to be provided precisely if it is not A-tailing? For example in the Nature protocol paper, it is TGACT. Mine spacer is a 19bp low complexity sequence with poor sequence quality. Appreciate for your help!

Best, Wei

Kennedy-Lab-UW / Duplex-Seq-Pipeline

Questions for the adapterSeq parameter #98