a-ludi / dentist

Close assembly gaps using long-reads at high accuracy.
https://a-ludi.github.io/dentist/
MIT License
47 stars 6 forks source link

Questions about Input Reads #9

Closed cassondranewman closed 4 years ago

cassondranewman commented 4 years ago

Hello, I have two questions regarding input reads:

1-Is it recommended to do error correction on the PB long reads before plugging them into dentist?

2-Is there a maximum coverage recommended for the input reads? Your example is 25x, so I was just wondering if there is an upper limit.

Thanks!

a-ludi commented 4 years ago

Regarding 1

It is actually recommended to not do error correction on the reads because it is a very compute intensive process. Depending on the kind of error correction, error correction itself might introduce new artifacts which might get integrated into the closed gaps.

Nevertheless, if you anyway have error corrected reads from the assembly process or if you have HiFi reads you can use them, too. You will have to adjust all the calls to daligner/damapper though. The parameters -k, -%, -w, -h need to be adjusted if you significantly increase -e. At the moment, I cannot give concrete advice on any particular configuration though. Note, the development branch of DENTIST includes major changes to the Snakemake workflow, particularly to the way alignment (daligner/damapper) flags are handled, so you might want to compile DENTIST yourself to avoid later adjustments.

Regarding 2

No, there is no hard limit on the read coverage, neither a lower limit nor an upper. However, experiments suggest a minimum read coverage of 15x. More than 40x seem to have very little effect on both sensitivity and accuracy. I would recommend downsampling the read set if it is much larger than 40x (e.g. 80x and larger) because of the runtime implications.

cassondranewman commented 4 years ago

Thank you!

cassondranewman commented 4 years ago

One follow up question about downsampling:

To downsample, would you recommend taking a random sample of the total reads, or by only taking reads longer than X (as in only keeping the longest reads)?

I fed a subsample of the longer reads (greater than length X) into the assembly pipeline, so I am not sure if I should do the same here.

a-ludi commented 4 years ago

Right now I see no good reason to keep "short" reads. So, I think it is a good strategy to increase x as in DBsplit -x until you are satisfied with the amount of read data (DBstats -n).

a-ludi commented 4 years ago

PS: You may get a good indication which -x to choose from running DBstats on the reads database.

cassondranewman commented 4 years ago

Ok! Thanks again