Finding Contaminants and Removing them

ajkarloss commented 5 years ago

Add option in quality check of sequences - to screen for possible contaminants Use mash to predict the contaminants in the raw sequence -- Prepare/Download the contaminant database from NCBI -- Prokaryotes database - will need to be updated regularly

-- Make a summary with Håkon script - nb as such not ok for metagenomics - can be precised

PB: We need to remove phiX - maybe trimming -> ask Thomas advise on issue

evezeyl commented 5 years ago

as Karin said: we might some advises as for the best way of creating the database: the default database contains all sequences...

do we have a way to clean the database: clean entry names (maybe better to modify R script name filter)? -- complete genomes or not - all eukaryotes sequences?
-- the database will need to be update regularly

frequency updates ? can we automatize as much as possible? is it eg. possible to scheldulde a way for updating/creating database with specified parameters?

Karin do not want any modification of the files here -> maybe remove phiX and adaptors Trim - but do not output files -> send them directly in the chanel - would that be a good enough solution? Not removing phiX and adaptors should aftect mashscreen ...

we need slight modification from Håkon's script: https://github.com/hkaspersen/misc-scripts/blob/master/scripts/mash_screen.R

on the organism of interst (ie in Håkons' script we filter organism of interest based on name: ex: "Listeria monocytogenes" but was not filtrered and poped up as likely contaminant because of this dot inserted in the name in the mash database -> so we might need to find an improvement of the filter.

line 74: needs to be modified for pattern matching - according to nextflow script

maybe add an option to transpose the output tables (question of preference - I prefer it transposed - easy to modify)

short explanation of what the filter is/do to help selecting for options-> on bifrost/Håkon (towards 0 we get also rare reads matching and toward 1: high values filter out all of the low-abundance sequences and we only get the ones that dominate the files

we might require some package installed for R and Bifrost/conda? (ie. had to install cairo librairy on my ubuntu system to be able to use the script - and additional svglite package in R - but maybe already in R system)

Thomieh73 commented 4 years ago

I think this paper was really helpful when it comes to Human contamination.

Interestingly, in discussion with Jen Lu, who worked on Kraken2 I heard that when she is doing classifications of metagenomic reads, she includes a unmasked human genome, in order to catch anything that looks like human.

NorwegianVeterinaryInstitute / Bifrost

Finding Contaminants and Removing them #28