Scrubby updates - Githubissues

esteinig commented 7 months ago

@gokeson Ammar mentioned you were integrating Scrubby with the pipeline! Really cool, it was mostly a small side project thing, but people seem to be using it here and there, so will do my best to upgrade it accordingly over the next two weeks or so.

Is there anything specific you were keen to see besides easy deployment via BioConda and/or binaries? We can keep a checklist here, including if you'd like to add anything relevant for you lab as well.

Scrubby wishlist:

[ ] Distribution via BioConda or at least private channel
[ ] HPRG reference genome database for depletion
[ ] Reference database downloader with pre-built indices

esteinig commented 7 months ago

Michael and Lachlan are publishing an updated version of their human pangenome database for depletion assessment (https://github.com/mbhall88/classification_benchmark, see preprint linked there). I am assessing it for clinical metagenomics data at the moment.

Michael's benchmark shows that besides the obvious performance of long reads, (simulated) Illumina reads are depleted with high sensitivity and specificity with Kraken2 and the pangenome DB, and that minimap2 is a great follow-up from that with the alignment, essentially the process that Scrubby follows to speed things up in our high-depth clinical samples. I need to assess this under more realistic conditions for low abundance pathogens, but that may not be relevant to you.

So... all this to say it looks like the approach has decent performance at least for getting rid of human reads in these simulated conditions ^^

gokeson commented 7 months ago

@esteinig Thank you very much for developing Scrubby. I can tell you that it has greatly improved this pipeline. I was using bbsplit and kneaddata in the past for hg depletion. Neither of the two offers the efficiency that scrubby offers. The opportunity to extract at a set taxa level has made my life a lot easier and assembly faster especially with clinical samples (less so with isolates, unless there's heavy contamination).

Scrubby wishlist:

[ ] Distribution via BioConda or at least private channel in the meantime

This is going to make CtGAP easier to use for BioConda fans, so I appreciate your help with this. Our aim is to get this pipeline ready for publishing around June (still waiting on one more ref genome to be sequenced and included). So I guess we will have plenty of time to test this pipeline with all Scrubby updates.

[ ] HPRG reference genome database for depletion (see next comment)

Thank you so much for including this in the next version. It will be mighty useful when we start our comparative analyses of global Chlamydia trachomatis genomes from clinical samples (mostly metagenomes) later in the year.

ammaraziz commented 7 months ago

HPRG reference genome database for depletion (see next comment)

Will you distribute the HPRG with scrubby or include a subcommand for downloading/preprocessing? That'd make it much easier for end users.But at the same time it means supporting the downloading which can be a huge pain - see the kraken2 repo issues which are filled with issues of downloading+building the reference. Another option is to have some docs for the end user that specifies best practices, eg download this reference, run this minimap2 command then point scrubby to it.

We could help there if you want!

esteinig commented 7 months ago

@gokeson Thanks for letting me know! Great to hear it's been useful. There is a lot of things that can be improved in terms of runtime and resource management. Without going to deep into the weeds, but we found that for our deep short read data the depletion step can still be quite slow when there is overwhelming host material.

I'd be very curious (if you don't mind communicating publicly about this, otherwise always happy to change to email) - are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads?

We have been building a technically fairly complex clinical diagnostic stack (including interface for interpretation and reporting, host genome analysis if you have consent and a few other nifty things) - it is not quite ready for people to use yet, but it's been used in production on challenging samples with expected low abundance of pathogenic agents from neurological conditions (strongly depending on wet-lab protocols in our experience). Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :)

@ammaraziz yes absolutely! If you remember vaguely from last year, there is something in the works for Cerebro (which includes host indices). I think it's a good suggestion in the interim and a simple downloader with a list of links is probably not too onerous to maintain (the indices are thankfully not as large as taxonomic databases)

ammaraziz commented 7 months ago

@esteinig I have to confess that @gokeson knows about Cerebro. I spilled the beans about it last year when chatting with him. He was very keen to test it but I didn't mention anything because you weren't ready to share and it was undergoing the big change at the time. I could run him through the installation and usage for Cerebro if I have your blessing. If I remember correctly the metagenomic project is related to this pipeline but not exactly the same.

I think worth discussing this outside of this repo but I'm not opposed to continuing the discussion here. We could have a zoom meeting to discuss Cerebro and actually I wanted to pick your brain on the best approach for Chlyamdia assembly, there are a few oddities we could use help with.

P.S Sola (Gokeson) is in QLD so our timezones are very close.

esteinig commented 7 months ago

Lmao no drama man! It's still not properly validated with clinical data and it's a bit of a construction site. I am a little hesitant to let people try and use it - it's absolutely gonna break for someone else and the database thing is a pain point ^^ I'm more than happy to share when it's usable of course, will let you know ASAP.

It's also very very much focused on low abundance sample types and short reads (at the moment) simply because we don't have many other datasets for diagnostics and doing something for the scope of :sparkles: metagenomics :sparkles: i.e. complex natural communities with diverse stuff hanging out, is not in scope for Cerebro. There's probably better MAG related pipeline from the ACE people at UQ.

Yeah agree, we can catch up on Zoom sometime on this! :)

gokeson commented 6 months ago

Hi Eike,

Apologies for the very late reply. I got busy away from work these past many days. Happy to be back at my desk again.

are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads?

Both! We do some in-house QC to ensure we recover as close as possible to a whole genome for our clinical samples. We also have a separate project focusing mostly on microbiome. In the later project, we don't do much in-house QC but also try to recover the whole genome sequences of C. trachomatis where possible (fails many times but is always worth trying).

Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :)

Sounds exciting. Ammar has mentioned it in the past and I am very much looking forward to it. We do not have such a system yet, so I am super keen to give this a try.

And we should meet up on Zoom soon. Ammar speaks greatly of you. Keen to put a face to the name.

On Wed, 31 Jan 2024 at 09:59, Eike Steinig @.***> wrote:

@gokeson https://github.com/gokeson Thanks for letting me know! Great to hear it's been useful. There is a lot of things that can be improved in terms of runtime and resource management. Without going to deep into the weeds, but we found that for our deep short read data the depletion step can still be quite slow when there is overwhelming host material.

I'd be very curious (if you don't mind communicating publicly about this, otherwise always happy to change to email) - are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads?

We have been building a technically fairly complex clinical diagnostic stack (including interface for interpretation and reporting, host genome analysis if you have consent and a few other nifty things) - it is not quite ready for people to use yet, but it's been used in production on challenging samples with expected low abundance of pathogenic agents from neurological conditions (strongly depending on wet-lab protocols in our experience). Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :)

@ammaraziz https://github.com/ammaraziz yes absolutely! If you remember vaguely from last year, there is something in the works for Cerebro (which includes host indices). I think it's a good suggestion in the interim and a simple downloader with a list of links is probably not too onerous to maintain (the indices are thankfully not as large as taxonomic databases)

— Reply to this email directly, view it on GitHub https://github.com/ammaraziz/ctgap/issues/9#issuecomment-1918114389, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCRGAHRHJHIYMVZWWXN6L23YRGCN5AVCNFSM6AAAAABCNTX4IKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYGEYTIMZYHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Regards, Shola

ammaraziz / ctgap

Scrubby updates #9