ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
181 stars 55 forks source link

Category 'unidentified' when analyzing the relative abundance of species #300

Closed bfalco closed 1 month ago

bfalco commented 6 months ago

Hi Chi,

I have verified that when analyzing the relative abundance of species, the category unidentified always appears, even after removing species without taxonomic assignment s__. Why does this category appear? Does it belong to one or several unidentified taxa?

Thank you very much, Bruno

ChiLiubio commented 6 months ago

Hi Bruno, Could you please attach an example that I can reproduce to check how it is generated? It is greatly appreciated if the example data in the package can be used.

Best, Chi

bfalco commented 6 months ago

Hi Chi,

I've observed an error that occurs in microeco when detecting certain taxa. In my data, for example, the species Faecalibacterium prausnitzii is found in Otu1, and the package didn't detect it when I entered it with a space s__Faecalibacterium prausnitzii, but after adding an underscore s__Faecalibacterium_prausnitzii, it did detect it. This was happening with several taxa, but I don't know the reason.

You can check it by loading my Data.zip with and without an underscore using view(dataset$tax_table).

Best regards, Bruno

ChiLiubio commented 6 months ago

Hi. I donot know how the "Data without underscore" is generated. I see there is no species information in otu1.

bfalco commented 6 months ago

My original file contains all OTUs with information, that is, without any s__, but microeco doesn't detect several species until I add the underscore (_) to replace the space within the name of certain species, as in the case of Faecalibacterium prausnitzii in Otu1.

It would be great if you could check this with data that is more familiar to you to find out if it's a general error or if it occurs from my R version (4.1.2).

Thank you very much for your work.

ChiLiubio commented 6 months ago

Hi Bruno, Thanks. Could you please send me your original file to carefully check how it happens? I think it is better to send me the file via the email (liuchi0426@126.com).

Best, Chi

ChiLiubio commented 6 months ago

Please also attach your code that I can use to reproduce your result. Thanks.

ChiLiubio commented 6 months ago

Hi Bruno, I find the issue comes from the function tidy_taxonomy. The pattern parameter contains many regular expressions used to delete those useless taxonomic information. The default pattern parameter is c(".*Unassigned.*", ".*uncultur.*", ".*unknown.*", ".*unidentif.*", ".*unclassified.*", ".*No blast hit.*", ".*sp\\.$", ".*metagenome.*", ".*cultivar.*", ".*archaeon$", "__synthetic.*", ".*\\sbacterium$", ".*bacterium\\s.*", ".*Incertae.sedis.*"). I found one pattern ".*bacterium\\s.*" unexpectedly matchs "Faecalibacterium prausnitzii", which has a bacterium followed by a blank. I will fixed this. To temporarily solve the issue, please adjust the parameter like this:

tax %<>% tidy_taxonomy(pattern = c(".*Unassigned.*", ".*uncultur.*", ".*unknown.*", ".*unidentif.*", ".*unclassified.*", ".*No blast hit.*", ".*sp\\.$", ".*metagenome.*", ".*cultivar.*", ".*archaeon$", "__synthetic.*", ".*\\sbacterium$", ".*\\sbacterium\\s.*", ".*Incertae.sedis.*"))

Thanks very much.

Best, Chi