OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.88k stars 332 forks source link

add adapters sequences from BGI/MGI sequencing data to built-in adapters #259

Open guidohooiveld opened 4 years ago

guidohooiveld commented 4 years ago

Hi. I noticed that on the SEQanswers forum a document from BGI has been posted that lists all sequences for the oligos and primers used for BGISEQ/DNBSEQ/MGISEQ library preparation. See here for the thread (2nd post).

On page 7:

The following sequences are used to filter the adapter contamination in raw data.
Forward filter:  AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA
Reverse filter:  AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG

Could these 2 (or maybe all listed) sequences be added to the set of built-in adapters fastp uses?

Thanks, Guido

sfchen commented 4 years ago

Ok, I will add them.

sfchen commented 4 years ago

After a search, I cannot confirm that these two sequences are BGI-Seq adapters.

I will contact BGI-Seq team to get their official adapter sequences, and update fastp as well.

guidohooiveld commented 4 years ago

Great, thanks for your willingness to do this! BTW, out of curiosity, how did you check this / were not able to confirm?

sfchen commented 4 years ago

I have got response from BGI team, they will send me the adapter list in a couple of days.

I will update then and release a new fastp version.

guidohooiveld commented 4 years ago

Being curious: was the BGI team able to provide the adapter sequences?

Shellfishgene commented 4 years ago

Any update on this? I also just received my first BGISeq data.

guidohooiveld commented 3 years ago

Kind reminder; I am about to receive another BGISeq data set. Thanks!

sfchen commented 3 years ago

Hi, I just got the sequences from MGI. I will update the built-in adapter sequences.

sfchen commented 3 years ago

I just add MGI/BGI adapter sequences to the known adapters:

knownAdapters["AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"] = ">MGI/BGI adapter (forward)";
knownAdapters["AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"] = ">MGI/BGI adapter (reverse)";

Could you please try the latest build, or use the latest prebuilt binary?

If you can upload a small MGI/BGI data, I can also have a try.

guidohooiveld commented 3 years ago

Sorry for my delayed reply. I used the latest version on Github (0.21), and compared the results obtained with the version before (0.20.1). To my surprise, both results were exactly the same. Is this expected, even though adapter trimming likely was done by BGI?? Still, I would have expected that some BGI adapters should have been found/trimmed, especially when these are specifically searched for. Thus that the results between the 2 versions should be slightly different, but not identical (at least for the number of bases trimmed due to adapters).

Filtering result:
reads passed filter: 43562268
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads failed due to low complexity: 2182
reads with adapter trimmed: 2837340
bases trimmed due to adapters: 14182202
Adapter or bad ligation of read1
The input has little adapter percentage (~0.217030%), probably it's trimmed before.
Adapter or bad ligation of read2
The input has little adapter percentage (~0.217030%), probably it's trimmed before.

fastp run command: fastp --in1 ./TEST_IN/RNA-1/RNA-1_1.fq.gz --in2 ./TEST_IN/RNA-1/RNA-1_2.fq.gz --out1=./TEST_OUT/RNA-1/RNA-1_1.fq.gz --out2=./TEST_OUT/RNA-1/RNA-1_2.fq.gz --low_complexity_filter --thread=16 --json ./TEST_OUT/RNA-1/RNA-1.fastp.json --html ./TEST_OUT/RNA-1/RNA-1.fastp.html

sfchen commented 3 years ago

Since your data is paired-end, fastp can trim the adapters without adapter sequence provided. So it already worked before.

guidohooiveld commented 3 years ago

Aha, I got it. I was a little confused; I assumed that since the adapter sequence auto-detection is disabled by default for PE data, adapter detection overlap analysis would also be disabled. However, I now understand that these are 2 separate processes, and that for PE data the latter (= adapter detection by per-read overlap analysis) is always occurring (and apparently cannot be disabled). Hence, results between versions are identical...