allind / EukDetect

MIT License
43 stars 16 forks source link

unable to install eukdetect #23

Open saras224 opened 2 years ago

saras224 commented 2 years ago

Hey @allind I tried to install eukdetect but when I run this command: conda env update --name Eukdetect -f environment.yml it gave an error saying: SpecNotFound: Invalid name, try the format: user/package

Can you figure out what the problem is and how can this be resolved?

Thanks in Advance

allind commented 2 years ago

Thanks for your interest in eukdetect and for reaching out. It looks like this error can happen when environment.yml isn't in visible to conda. Can you double check that the environment.yml file is in the same directory you're running this command in?

saras224 commented 2 years ago

Hi @allind That issue got resolved as I tried to run the same command inside the directory where I had built the database. I was able to run the analysis for all the samples and I got two final text files for each sample. My aim is to separate out just the fungal species from the text file(A1_filtered_hits_taxonomy.txt) which I can do but the next thing is that I want to plot the graphs based on the abundance but I do not see any information regarding the abundance in the result file. Can you suggest to me the tools or software that can be used for the comparative analysis for all the samples based on the abundance? And can I treat the read counts for each species as their abundance?

Thanks Saraswati Awasthi

saras224 commented 2 years ago

Hi @allind Please help me in calculating the relative abundance of the species detected by eukdetect.

Thanks Saraswati

allind commented 2 years ago

Eukdetect doesn't calculate relative abundance of taxa. One reason for this is that most metagenomic sequencing libraries are primarily composed of bacterial reads - and as eukdetect only considers eukaryotes, calculating relative abundance of eukaryotic taxa has the potential to be misleading. However, this is something we could consider revisiting.

If what you're interested in looking at is changes in the abundance of single taxa over samples, the read counts reported in the output files normalized by the sequencing library size (so the read count divided by the total number of reads) would be a good metric. If that's not what you're trying to do, let me know.

saras224 commented 2 years ago

fungus_plot

I want to make plot which looks something like this. compare different species in different samples.

Also different samples have different read lengths, so can you make changes in script that would take all the samples with different read lengths in a loop.

Thanks Saraswati

allind commented 2 years ago

If this is something that you want to create, you can estimate the relative abundance from the "Total_marker_coverage" number that eukdetect reports in the table output. This number is the percentage of observed marker gene sequence for a given taxon that has one or more aligned read. In most metagenomic sequencing libraries, this number will usually fall well short of 100%. If this is the case for your samples, you can estimate relative abundance of fungi by pulling out the fungal species from the table file and dividing the total_marker_coverage for each taxon by the sum of all marker coverages (i.e., divide each row by the column sum). I have not tested this extensively so make sure that these relative abundance numbers also make sense with how many reads you see aligning to each species.

I would urge some caution with interpreting these kinds of results when you're working with the eukaryotic fraction of the library only. If you're interested in looking at how eukaryotic species vary between samples, I would focus more at looking at how the amount of individual species vary between samples than how the community composition varies, as that will be more robust with these data.

saras224 commented 2 years ago

Hi @allind

  1. Can I use the read counts directly as the abundance because I have the same number of reads in all the samples so even if I divide all the read counts for each species by the total read count in the sample then the proportion will remain the same because they are getting divided by the same number.

  2. Second question that if for a particular species suppose X species is found to have 100 read counts in sample A and species Y is found to have 60 read counts in the same sample so can we say that species X is more in abundance than species Y in sample A?

  3. Also you said that it will be more meaningful to plot how a species is varying across the different samples but the question is on what parameter should I be plotting? should it be read counts or the total marker coverage? If I plot with total marker coverage as you said earlier, the problem is that the sum of the total marker coverage column in different samples will be different so I cannot use it for indicating the abundance across different samples as I have understood.

  4. One more doubt: In the table, there are two columns one is the percentage observed marker and the other is total marker percentage what do both indicate since the name is quite confusing.

Thanks

allind commented 2 years ago

Hi, there's a new release of eukdetect that may help you with these questions. The new release calculates a proxy for relative abundance called "eukaryotic fraction". Since eukaryotes usually make up very small amounts of microbiome sequencing libraries, this number should also be considered alongside the proxy for absolute abundance, which is reported by eukdetect as "reads per kilobase of sequence" or "RPKS". Please reach out with any questions.