ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Digital kmer-based normalization #97

Closed taltman closed 4 years ago

taltman commented 4 years ago

This is an approach to identify k-mers that are disproportionately represented in the reads, which can cause problems for the assembler. Once identified, they can be thinned out.

Once Ryan is on-board, I will assign this to him.

taltman commented 4 years ago

/cc @JustinChu

RyloByte commented 4 years ago

Here are the tools I'm looking into: khmer (https://github.com/dib-lab/khmer) kmernorm (http://sourceforge.net/projects/kmernorm) ORNA (https://github.com/SchulzLab/ORNA)

Are there specific transcriptomes I should focus on for this? Otherwise I have some of my own that I can use a tester.

My loose plan is:

  1. assemble transcriptomes without
  2. run transcriptomes through one or more of the above tools
  3. assemble normalized transcriptomes
  4. use a tools like metaQUAST to assess the qualtiy of assemblies
  5. monitor the resource usage for each trial (any suggestions on how I might do this?)
  6. report back with findings
taltman commented 4 years ago

Thanks for jumping in @McGlock !

We are trying to develop some validation datasets, but they are not ready yet. I'd say that @ababaian can suggest some datasets that you can use for now just for the sake of building something. Ideally you can make a test example using Singularity or Docker, if not, just a reproducible script checked in to a dev branch in the 'notebook' directory would be good. You can use a dev branch like 'mcglock-dev'.

taltman commented 4 years ago

BTW, don't assemblers like SPAdes and MegaHIT do this kind of kmer normalization?

ababaian commented 4 years ago

The PEDV high coverage are probably a good toy dataset. They're pretty robust: https://github.com/ababaian/serratus/issues/96#issuecomment-629440696

I know certainly that some datasets suffer from much higher 3' bias than others, I can't recall where I saw it really bad though

Libraries with clear CoV bias are: SRR1194065, SRR1194066, SRR1194067, SRR1194068.

taltman commented 4 years ago

You can measure runtime using the shell builtin 'time', or /usr/bin/time. For memory usage profiling, you can use valgrind. Informally, you can just monitor the progress using 'top' and noting how much memory & cores it seems to utilize.

taltman commented 4 years ago

BTW, please read over the CONTRIBUTING.md file in this repo to understand where experiments like this go, in terms of code & output data.

RyloByte commented 4 years ago

BTW, don't assemblers like SPAdes and MegaHIT do this kind of kmer normalization?

I looked around a little but couldn't verify if either of them use digital/kmer normalization. I did find some reference to it being tried as a perprocessing step before assembly. Let me know if you have any more infor on this though.

RyloByte commented 4 years ago

You can measure runtime using the shell builtin 'time', or /usr/bin/time. For memory usage profiling, you can use valgrind. Informally, you can just monitor the progress using 'top' and noting how much memory & cores it seems to utilize.

Cool thanks for this, I know time but I'll look into valgrind. Otherwise, yeah I was just going to lurk on top.

rcedgar commented 4 years ago

No lurking required -- you can use the -b option to run in "batch" mode and capture the output for later analysis, something like

while true top -n1 -b | grep ProgramName >> top.log; sleep 10; done

Kill the loop when the process completes. Finding maximum usage for memory or other resource can be done my massaging the file with sed or whatever followed by sort | head -n1.

taltman commented 4 years ago

Check out the SPAdes manual, sections 3 & 4, regarding read correction by kmer counting:

http://cab.spbu.ru/files/release3.14.1/manual.html#sec3

RyloByte commented 4 years ago

Check out the SPAdes manual, sections 3 & 4, regarding read correction by kmer counting:

http://cab.spbu.ru/files/release3.14.1/manual.html#sec3

I did see those standalone scripts, I just didn't know if those were baked into the pipelines or just convenience scripts to aid in preprocessing. But yes indeed, those tools should on on the list of things to try for sure!

asl commented 4 years ago

Tools presented in section 4 of the manual are essentially standalone binaries that use some internal SPAdes algorithms. We decided that it would be interesting to provide them standalone.

asl commented 4 years ago

In addition to this, I would not suggest doing any "digital normalization". Often these approaches create coverage gaps. This is even more likely for the data in question as the coverage across the genome is very non-uniform.

Here is example of coverage from PEDV dataset:

image

ababaian commented 4 years ago

Good to close for now?

RyloByte commented 4 years ago

Yes I believe so, after input from Tomer and the SPAdes crew, I don't think this is needed.

On Tue, Jun 16, 2020 at 4:15 PM Artem Babaian notifications@github.com wrote:

Good to close for now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ababaian/serratus/issues/97#issuecomment-645058199, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUJ3JR57J5ZSM5H5YNSFWDRW74KLANCNFSM4NCLU3QA .

-- Ryan J. McLaughlin, MSc

e1: mclaughlinr2@gmail.com c: 814.758.4570 linkedin https://www.linkedin.com/in/ryan-mclaughlin-725b09117/