Sequencing depth for non-human sample

AlbatrosHurleur commented 4 years ago

Hello,

I wonder if there are guidelines regarding number of reads (or pairs) to get from sequencing for non-human Cut&Run experiments?

I'm working on a diatom species (~28M bp genome) and 10 millions PE-150 reads would result on very few 0-coverage regions (~54x coverage in average), leading to a very high background in control sample. The same 10M PE-150 reads represent around 0.5x average coverage for a 3.2G bp human genome. If I scale linearly, it would mean only ~90K reads for my diatom.

As SEACR is sensitive to high background, would you advise to target the same value on much smaller genomes, or it doesn't really matter as long as the signal/noise ratio is roughly the same? Also, is it advisable to sequence deeper if using single-end reads?

Thank you

mpmeers commented 4 years ago

Hi,

SEACR will not function properly under the conditions you describe, since it will consider the broad 54x coverage to be a single "signal block" and lose the resolution of enrichment. This is a known weakness that I have interest in addressing, but unfortunately haven't had time to up to this point.

To use SEACR, it would be ideal to sequence at a depth that approximates 0.5x coverage of your genome, although I acknowledge it may be difficult to collect enough samples to do so on a high-capacity flow cell. Alternatives might include sub-sampling your reads accordingly after sequencing and mapping, or using an alternative peak caller such as Macs2 that primarily considers local enrichment. If I come up with a fix for this issue in SEACR I will let you know.

Regarding single-end reads: SEACR is also designed for data derived from read pairs, since the global background in single-end data will be relatively non-complex at the low end in the sense that it will be "quantized" by the uniform length of the reads. This can feed back into the global thresholding and may cause issues, and therefore I recommend using paired end data, though I have not systematically tested the extent to which single-end data does or does not work well.

Mike

AlbatrosHurleur commented 4 years ago

Hi Mike,

Thanks a lot for your clear answers. We'll target a lower sequencing depth, with paired-end reads; I'll see with the wet lab team how they prefer to manage this. I thought as well about sub-sampling if I still get too much background coverage; several passes could act as "in silico replicates", and Macs2 could be tested on the whole library to compare results.

I am also wondering to what extent read pairs that overlap each other (eg PE150 on ~200bp DNA fragments) would add coverage bias and impair SEACR results? If they are a problem, is using PE50 reads an alternative even though ambiguous mapping would increase?

Thanks again!

mpmeers commented 4 years ago

Hi,

Overlap should not be an issue as long as adapters are properly trimmed from reads, although this can introduce its own challenges. We typically sequence only 25 bp paired end reads in order to entirely avoid complications with trimming adapters. There may be reasons to sequence longer reads (e.g. high density of repeat regions), but given the likelihood that the majority of your fragments will be shorter than 200 bp, I'd recommend using shorter read lengths where possible all else being equal. However, if there is a compelling reason for you to use 150 bp PE reads, my sense is that SEACR's performance should not be affected so long as the reads can be trimmed of adapters and coherently assembled into paired-end fragments.

Mike

AlbatrosHurleur commented 4 years ago

Hi Mike, thanks a lot for your advice; I'll keep you in touch about how it went. Also, are the adapters that may be left in the library one of the reasons why you typically perform local alignment mapping?

mpmeers commented 4 years ago

Hi,

We actually typically do end-to-end mapping with bowtie2, using the following parameters:

--end-to-end --very-sensitive --no-mixed --no-discordant -q --phred33 -I 10 -X 700

The fact that we sequence such short reads enables us to get away with no real adapter trimming while still doing end-to-end mapping. Perhaps local mapping would help you with 150 bp reads, but I haven't tested that systematically.

Mike

AlbatrosHurleur commented 4 years ago

Hi Mike,

Thanks, we'll do PE50 reads and end-to-end mapping. I got confused about that because some mapping examples show -local flag, but end-to-end should be better when possible. Now I'm looking forward to get the data!

Best regards

AlbatrosHurleur commented 3 years ago

Hi Mike,

To follow up on this topic, we did sequence our Cut&Run experiment with PE150 as it was less expensive (1GB of data, the smallest amount we could order). I performed the SEACR analysis process after trimming to PE50 then 30 different random sub-samplings to get 300K properly-paired reads (~0.5x coverage on our specie); also at 0.5x with full-length reads; also with full PE150 library (~50x) using MACS2. Unfortunately, the results are not meaningful as it seems the IP library was not enriched in our histone mark loci for some reason. We may try this again if the wet lab colleagues find what went wrong, I'll keep you informed if it works next time.

Best regards

mpmeers commented 3 years ago

Hi,

Thanks for following up--please keep me updated if you make any progress, and I will post here if I come up with a fix that will allow for using deeper data of the sort that you have.

Mike

FredHutch / SEACR

Sequencing depth for non-human sample #41