Open ajkarloss opened 5 years ago
as Karin said: we might some advises as for the best way of creating the database: the default database contains all sequences...
- do we have a way to clean the database: clean entry names (maybe better to modify R script name filter)? -- complete genomes or not - all eukaryotes sequences?
-- the database will need to be update regularly- frequency updates ? can we automatize as much as possible? is it eg. possible to scheldulde a way for updating/creating database with specified parameters?
Karin do not want any modification of the files here -> maybe remove phiX and adaptors Trim - but do not output files -> send them directly in the chanel - would that be a good enough solution? Not removing phiX and adaptors should aftect mashscreen ...
we need slight modification from Håkon's script:
https://github.com/hkaspersen/misc-scripts/blob/master/scripts/mash_screen.R
- on the organism of interst (ie in Håkons' script we filter organism of interest based on name: ex: "Listeria monocytogenes" but
was not filtrered and poped up as likely contaminant because of this dot inserted in the name in the mash database -> so we might need to find an improvement of the filter. - line 74: needs to be modified for pattern matching - according to nextflow script
- maybe add an option to transpose the output tables (question of preference - I prefer it transposed - easy to modify)
- short explanation of what the filter is/do to help selecting for options-> on bifrost/Håkon (towards 0 we get also rare reads matching and toward 1: high values
filter out all of the low-abundance sequences and we only get the ones that dominate the files
- we might require some package installed for R and Bifrost/conda? (ie. had to install
cairo librairy
on my ubuntu system to be able to use the script - and additionalsvglite
package in R - but maybe already in R system)
Add option in quality check of sequences - to screen for possible contaminants Use mash to predict the contaminants in the raw sequence -- Prepare/Download the contaminant database from NCBI -- Prokaryotes database - will need to be updated regularly
-- Make a summary with Håkon script - nb as such not ok for metagenomics - can be precised
PB: We need to remove phiX - maybe trimming -> ask Thomas advise on issue