OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

Request details on programmatic database setup for confindr #33

Closed bala-ruokavirasto closed 10 months ago

bala-ruokavirasto commented 2 years ago

Hi,

I used the below command: confindr_database_setup -s key_secret.txt -o confindr_database/

And obtained the database for only three species as below: confindr_database$ ls Escherichia_db_cgderived.fasta Salmonella_db_cgderived.fasta gene_allele.txt rMLST_combined.fasta Listeria_db_cgderived.fasta download_date.txt profiles.txt refseq.msh

However, I need the db_cgderived.fasta for Yersinia and Campylobacter genus as well!

May i know how to obtain those as well programatically?

Best Regards, Bala

adamkoziol commented 2 years ago

Hi Bala,

Since you have the rMLST database, you don't need the CGE-derived files. Just run ConFindr in rMLST mode (use the --rmlst flag), and any bacterial genus should be able to be processed.

A

bala-ruokavirasto commented 2 years ago

Hi,

Thanks for the reply!

I tested as you mentioned and got the results below: $cat ecoli_test/results/confindr/confindr_report.csv Sample,Genus,NumContamSNVs,ContamStatus,PercentContam,PercentContamStandardDeviation,BasesExamined,DatabaseDownloadDate FIAR-847_S5_1_trim,Escherichia,0,False,0,0,38310,ND FIAR-847_S5_2_trim,Escherichia,0,False,0,0,38310,ND $cat salmonella_test/results/confindr/confindr_report.csv Sample,Genus,NumContamSNVs,ContamStatus,PercentContam,PercentContamStandardDeviation,BasesExamined,DatabaseDownloadDate FIAR-844_S2_L001_1_trim,Salmonella,0,False,0,0,61956,ND FIAR-844_S2_L001_2_trim,Salmonella,1,False,0,0,61956,ND $ cat listeria_test/results/confindr/confindr_report.csv Sample,Genus,NumContamSNVs,ContamStatus,PercentContam,PercentContamStandardDeviation,BasesExamined,DatabaseDownloadDate FIXT-208_S17_L001_1_trim,Listeria,0,False,0,0,28425,ND FIXT-208_S17_L001_2_trim,Listeria,0,False,0,0,28425,ND

There are BasesExamined for the above three species. However, the following two species miss that information as below: $ cat campy_test/results/confindr/confindr_report.csv Sample,Genus,NumContamSNVs,ContamStatus,PercentContam,PercentContamStandardDeviation,BasesExamined,DatabaseDownloadDate 131469S9L001_1_trim,Campylobacter,0,False,ND,ND,0,ND 131469S9L001_2_trim,Campylobacter,0,False,ND,ND,0,ND $ cat yersinia_test/results/confindr/confindr_report.csv Sample,Genus,NumContamSNVs,ContamStatus,PercentContam,PercentContamStandardDeviation,BasesExamined,DatabaseDownloadDate FIXT-266_S6_L001_1_trim,Yersinia,0,False,ND,ND,0,ND FIXT-266_S6_L001_2_trim,Yersinia,0,False,ND,ND,0,ND

Could you clarify why the BasesExanined were zero for the above two species and have some value to only E.coli, Salmonell and Listeria? It would be nice to know how these BasesExamined values are produced in confindr tool?

Best Regards, Bala

From: adamkoziol @.> Sent: Monday, March 14, 2022 4:11 PM To: OLC-Bioinformatics/ConFindr @.> Cc: Jayaprakash Balamuralikrishna (Ruokavirasto) @.>; Author @.> Subject: Re: [OLC-Bioinformatics/ConFindr] Request details on programmatic database setup for confindr (Issue #33)

Hi Bala,

Since you have the rMLST database, you don't need the CGE-derived files. Just run ConFindr in rMLST mode (use the --rmlst flag), and any bacterial genus should be able to be processed.

A

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOLC-Bioinformatics%2FConFindr%2Fissues%2F33%23issuecomment-1066841133&data=04%7C01%7Cbalamuralikrishna.jayaprakash%40ruokavirasto.fi%7C37b16cca7fb94422543408da05c4707f%7C7c14dfa4c0fc47259f0476a443deb095%7C0%7C0%7C637828638500684721%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QzfuaO0euVIOHMlcqDyhlPmQ3Ejz8HIj6KCe2NSxmgM%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUFIHIPDM4R4E3EZ7IXX2PDU75CGNANCNFSM5QQBXHYA&data=04%7C01%7Cbalamuralikrishna.jayaprakash%40ruokavirasto.fi%7C37b16cca7fb94422543408da05c4707f%7C7c14dfa4c0fc47259f0476a443deb095%7C0%7C0%7C637828638500684721%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NGs9hwB1GBas9GtMCoJWoDxhtQjyySu4M50g4KAxdCU%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbalamuralikrishna.jayaprakash%40ruokavirasto.fi%7C37b16cca7fb94422543408da05c4707f%7C7c14dfa4c0fc47259f0476a443deb095%7C0%7C0%7C637828638500684721%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c2PoBCcxRBpF5W%2FYzB39%2FQ4gOA1hzzdumHCCd%2FFtCkg%3D&reserved=0 or Androidhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbalamuralikrishna.jayaprakash%40ruokavirasto.fi%7C37b16cca7fb94422543408da05c4707f%7C7c14dfa4c0fc47259f0476a443deb095%7C0%7C0%7C637828638500684721%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MccyAAT4xvdQQFPOYiOGJvCpCm1gDzX7ORTT12%2Bx6BA%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.**@.>>

adamkoziol commented 2 years ago

Based on the fact that the Escherichia samples had 38310 bases as the bases examined, it looks like you're still not using the --rmlst mode. Could you please include the command line call to ConFindr you used?

The bases examined are the total number of bases present in the sequence files containing the alleles returned by the KMA screen (this can be printed to the screen using the --verbosity debug argument). This sequence file can be inspected if you use the -k argument to keep the files. It is named as follows: sample_name_alleles.fasta, e.g. FIAR-847_S5_1_trim_alleles.fasta.

If you are using CGE-derived databases, the alleles in the FASTA file should have names like b0436_1, while if you are using the rMLST database, the alleles should have names like BACT000001_10671.

A

pcrxn commented 1 year ago

I'll close this issue in 30 days if there's no further updates!

pcrxn commented 10 months ago

Closed due to stale issue