dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
756 stars 296 forks source link

Filtering out reads with low-abundance k-mers (contaminant reads) #1899

Closed GA-Goig closed 4 years ago

GA-Goig commented 4 years ago

Hi everyone!

First of all excuse me if this is not the proper place to ask the following question.

I am trying to use khmer in bacterial whole-genome sequencing data (not metagenomics) to filter out pontential contaminating reads from other organisms. I guess there is a way of doing this in khmer and so far I tried some tests but I would like to know whether the khmer modules I am using (and parameters) are the right ones.

As I understood, the approach woud be to build a k-mer countraph using load-into-counting-py (I am using default parameters here). Then with this countgraph I assume khmer can discard reads that have low-abundance k-mers. I am doing this with filter-abund.py -C 2 my_coungraph.

It is just that, after reading the documentation, I am not sure of being following the best approach or even doing what I think I am doing.

Does anyone have used khmer for this purpose or know how to do it?

Thank you very much in advance,

Galo

ctb commented 4 years ago

Hi Galo,

that looks right! You will probably need to increase the memory size used by load-into-counting.py a bit, I'd suggest 100 MB (so something like load-into-count.py -M 1e8 should work).

Note that this is probably going to remove sequencing errors more so than contamination, happy to chat about other approaches for contamination (often contamination can be removed after assembly, note).

best, --titus

On Thu, Nov 14, 2019 at 04:30:27AM -0800, Galo Adrián Goig Serrano wrote:

Hi everyone!

First of all excuse me if this is not the proper place to ask the following question.

I am trying to use khmer in bacterial whole-genome sequencing data (not metagenomics) to filter out pontential contaminating reads from other organisms. I guess there is a way of doing this in khmer and so far I tried some tests but I would like to know whether the khmer modules I am using (and parameters) are the right ones.

As I understood, the approach woud be to build a k-mer countraph using load-into-counting-py (I am using default parameters here). Then with this countgraph I assume khmer can discard reads that have low-abundance k-mers. I am doing this with filter-abund.py -C 2 my_coungraph.

It is just that, after reading the documentation, I am not sure of being following the best approach or even doing what I think I am doing.

Does anyone have used khmer for this purpose or know how to do it?

Thank you very much in advance,

Galo

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dib-lab/khmer/issues/1899 -- C. Titus Brown, ctbrown@ucdavis.edu

GA-Goig commented 4 years ago

Hello and thank you very much for your prompt response!

So actually I used 100Gb of memory as we have available a server with 256GB RAM.

Currently we are removing contaminations using Kraken, by discarding reads that are classified as other genus than the target. (e.g. we are discarding all non-Klebsiella reads from K. pneumoniae sequencings). However, this has the flaw of potentially discarding regions that are not in the DB or genetic regions that have been acquired by horizontal gene transfer. Given this issue, we were suggested to use khmer as one would expect contaminating reads to have different k-mers than the target organism and to be in significantly lower proportions.

Do you think khmer would be a good/better approach for our purposes?

I would really appreaciate you opinion on this.

Best, Galo

ctb commented 4 years ago

Hi Galo,

agree with your downsides to kraken :).

Suggest looking into blobtools.

--t

On Thu, Nov 14, 2019 at 09:00:45AM -0800, Galo Adrián Goig Serrano wrote:

Hello and thank you very much for your prompt response!

So actually I used 100Gb of memory as we have available a server with 256GB RAM.

Currently we are removing contaminations using Kraken, by discarding reads that are classified as other genus than the target. (e.g. we are discarding all non-Klebsiella reads from K. pneumoniae sequencings). However, this has the flaw of potentially discarding regions that are not in the DB or genetic regions that have been acquired by horizontal gene transfer. Given this issue, we were suggested to use khmer as one would expect contaminating reads to have different k-mers than the target organism and to be in significantly lower proportions.

Do you think khmer would be a good/better approach for our purposes?

I would really appreaciate you opinion on this.

Best, Galo

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dib-lab/khmer/issues/1899#issuecomment-553981278 -- C. Titus Brown, ctbrown@ucdavis.edu

GA-Goig commented 4 years ago

Thank you very much! I'll give it a look =)