Weeks-UNC / shapemapper2

Public repository for ShapeMapper 2 releases
Other
29 stars 16 forks source link

Raw fastq files processing #38

Closed angelika888 closed 1 year ago

angelika888 commented 1 year ago

Hello!

I'd like to ask if you recommend any pre-processing steps for fastq files before the ShapeMapper2 run? I mean analysis for sequencing data after the amplicon workflow of SHAPE-MaP, where PCR cycles (amplicon PCR and limited-cycle PCR during library construction) do not exceed 30 cycles together and adapters were added by ligation. In fastq files, there are already trimmed-off adaptor sequences. I will be grateful for your advice.

Regards Angelika

Psirving commented 1 year ago

Hi Angelika, Shapemapper2 includes all of the necessary steps to process raw fastq and fastq.gz files. That being said, a good first step in any sequencing experiment is to run a quality control program such as FastQC. One thing to pay particular attention to is the duplication level. Some amount of duplication is expected, but too much can be a problem if you are doing RING-MaP, PAIR-MaP, or DANCE-MaP downstream. Duplication doesn't seem to affect ShapeMapper results very much.

angelika888 commented 1 year ago

Ok, thank you for your answer. I've checked the duplication level and it's very high, but I've also read that this FAstQC module will issue an error if non-unique sequences make up more than 50% of the total. So, if my library is for example only one amplicon, there could be such a problem.

Psirving commented 1 year ago

Your correct that FastQC will flag amplicon data for duplication levels. Amplicon experiments are hard to pin down for duplication. The expected duplication level (the number of times we expect to see identical sequences with no PCR duplication) depends on read length, mutation frequency, and total read count. We don't have a particular number we are aiming for, but untreated samples should be very highly "duplicated" because they are mostly identical, while modified samples should be less "duplicated" because they contain somewhat random mutations throughout each read.

angelika888 commented 1 year ago

Thanks for clarifying. So, from these data, we don't know the real duplication level, but as you said duplication doesn't seem to affect ShapeMapper results very much. But what if I would like to use these data also for DANCE-MaP analysis? Is it possible?

Psirving commented 1 year ago

That's correct. You can definitely use these data for DANCE-MaP as long as they are DMS-treated. SHAPE reagents don't work as well with DANCE-MaP. For reliable DANCE-MaP results you want high mutations per read, long reads, and a lot of them, about 1M for fitting reactivity profiles and about 3M for subsequent RING/PAIR-MaP data. When running ShapeMapper, use the --per-read-histogram flag. This will produce a table in shapemapper_log.txt that contains a histogram of mutation counts per read and read lengths. You want most of your reads to be full-length as DANCE-MaP filters out any non-full-length reads. You want the peak of the mutations-per-read histogram to be in the 5-8 range, depending on your read lengths (longer reads have more mutations).

Psirving commented 1 year ago

Re: duplication. This is just hard to diagnose for amplicon data. It is usually a problem when your cDNA input copy number for PCR1 is very low. Sometimes to diagnose this, we run deduping software on reads then perform 2 Shapemapper runs. One with deduping and one without. If the results after DANCE-MaP look much better with deduping (VERY subjective) then we say that duplication was a problem, and try the experiment again with more input cDNA. For your case, I would just assume duplication is not an issue and move forward. If your results look really weird, revisit duplication as a possible source of the problem.

angelika888 commented 1 year ago

Ok, thank you for the detailed answer and help! And the last question, do you recommend any software to test the deduplication step? And I wish you Merry Christmas! :)

Psirving commented 1 year ago

Nubeam-dedupe or BBTools dedupe. You’ll want to remove the 5 Ns that you added to PCR1 primers. For that I use umi-tools. Merry Christmas!

angelika888 commented 1 year ago

Thank you!