Problems with the decontamination of metagenomic data or viral data

benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data

https://benjjneb.github.io/decontam/

147 stars 25 forks source link

Problems with the decontamination of metagenomic data or viral data #136

Open 1023011930 opened 1 year ago

1023011930 commented 1 year ago

Dear Sir or Madam, I appreciate your significant contribution in decontaminating the sequencing data. I haven't used decontam yet, but I think most of the examples were researched based on ASV/OTU relative abundance tables As for metagenomic data，As I understand it, removal can be done by constructing MAG abundance tables But viral data are difficult to bin and thus constitute a MAG, and they are often found in the form of a Contig. I would like to ask if decontam can decontaminate based on contig abundance (is this scientific？) From what I've read, the "https://www.nature.com/articles/s41586-020-2192-1" article uses Karken categorical data as the basis for decontaminated decontamination, is this a good approach? Thank you very much for giving me some instructions!

1023011930 commented 1 year ago

To summarize, my problem is that some of the macrogenomic data is difficult to binning into "metaOTU", such as viral data (the kind of ASV relative abundance table that can't generate normal 16s data). What should I do with these data in the macrogenome to use them as input files for decontam?

benjjneb commented 1 year ago

You can use decontam with any feature type that has a relative abundance in each sample. This includes contigs.

1023011930 commented 1 year ago

You can use decontam with any feature type that has a relative abundance in each sample. This includes contigs.

Does this mean that I can use software such as "BWA OR bowtie2" to quantify the contig, then calculate the RPM or TPM, and then use them as the relative abundance for decontam? Thanks for your kindness!

1023011930 commented 1 year ago

In my perception, a standardized contig abundance scale is not the same as a relative abundance scale like 16s.

In my perception, a 16s abundance table is where each sample sums to 100% and each OTU takes up a portion of the 100%, and the percentage（0-100%） taken up is the data

Whereas standardized contig abundance tables generally use the TPM (OR read count) of each contig as the data，So their values are not necessarily in the (0-100) range

It seems to me that these two abundance tables are not the same, may I ask how the contig abundance table is generally handled, if there is a corresponding tutorial or literature I would be very grateful! Thank you for your answer.

benjjneb commented 1 year ago

TPM is also a relative abundance. You can use it just the same.

A "relative abundance" measure, is any metric that informs about the abundance of this relative to that. If the TPM of contig X is doubel that of contig Y, then X has double the relative abundance (by this measure) of contig Y.

1023011930 commented 1 year ago

TPM is also a relative abundance. You can use it just the same.

A "relative abundance" measure, is any metric that informs about the abundance of this relative to that. If the TPM of contig X is doubel that of contig Y, then X has double the relative abundance (by this measure) of contig Y.

I think I see what you mean, I will try using standardized contig relative abundance, thank you very much for your reply!