MaestSi / MetONTIIME

A Meta-barcoding pipeline for analysing ONT data in QIIME2 framework
GNU General Public License v3.0
76 stars 17 forks source link

sampling_depth number? #12

Closed sekhwal closed 4 years ago

sekhwal commented 4 years ago

Hi, I am running Evaluate_diversity.sh but I am not sure about -d . Here is my reads information. Please let me know which number should I give for -d here. In logfile.txt it is showing "Number of basecalled reads: 85256".

Number of reads assigned to BC01: 18 Number of reads assigned to BC02: 11 Number of reads assigned to BC03: 39 Number of reads assigned to BC04: 25 Number of reads assigned to BC05: 13 Number of reads assigned to BC06: 33 Number of reads assigned to BC07: 4 Number of reads assigned to BC08: 26043 Number of reads assigned to BC09: 23447 Number of reads assigned to BC10: 26536 Number of reads assigned to BC11: 27 Number of reads assigned to BC12: 62

MaestSi commented 4 years ago

Hi, personally I would delete all samples except BC08, BC09 and BC10, because the comparison has to be performed starting with the same number of reads. I would try using 1000 and have a look at the PCA, especially at weighted_unifrac_emperor.qzv file. You can try with whatever number you want up to the minimum number of reads from a sample (e.g. 23000), but it is very time consuming, and I don't think it is worthwhile to use so many reads, since they have high error rate.

sekhwal commented 4 years ago

As we have discussed previously --that I can delete files other than BC08, 09 and 10 now using delete BC.fastq.gz. Do I need to perform this delete command in ../analysis directory that is produced after running Launch_MinION_mobile_lab.sh and then I should go for Evaluate_diversity.sh with -d 1000?

Is that correct?

MaestSi commented 4 years ago

Maybe don't delete them, but just create another folder and move them there. The important is that they are not anymore in analysis folder. Simone

sekhwal commented 4 years ago

Sounds good!

sekhwal commented 4 years ago

Hi,

I have got following output files from Evaluate_diversity.sh command. Now which file should I take to find differences between the communities in bar graph? Do I need to use QIIME to get bar graph or any other methods at MetONTIIME.


Picture1

MaestSi commented 4 years ago

Hi, you can have a look at weighted_unifrac_emperor.qzv, to see if your samples cluster following a certain pattern. Other analyses are outside of the scope of this pipeline, as I am not expert about this and I would not be able to give you proper advice. Please ask it in the QIIME2 forum, and if you find out anything interesting for your aim, please let me know! Simone

sekhwal commented 4 years ago

Hi, I came to know how to visualize output files of Evaluate_diversity.sh. Here, are some commands-block and more description in these tutorials. Therefore, I need to know, can these command-blocks run on MetONTIIME pipeline or do I need to run these commands on QIIME2 only.

http://compbio.ucsd.edu/wp-content/uploads/2018/07/20180621_oslo_university_microbiome_analysis_with_qiime2_tutorial.pdf

https://github.com/qiime2/docs/blob/master/source/tutorials/moving-pictures.rst


.. command-block::

qiime diversity beta-group-significance \ --i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file sample-metadata.tsv \ --m-metadata-column body-site \ --o-visualization core-metrics-results/unweighted-unifrac-body-site-significance.qzv \ --p-pairwise

qiime diversity beta-group-significance \ --i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file sample-metadata.tsv \ --m-metadata-column subject \ --o-visualization core-metrics-results/unweighted-unifrac-subject-group-significance.qzv \ --p-pairwise

MaestSi commented 4 years ago

Hi, these are QIIME2 instructions that can be run outside of MetONTIIME pipeline.

sekhwal commented 4 years ago

Hi, I want to run Evaluate_diversity.sh with only high reads samples (BC08, BC09, BC10), but now if I remove other samples the pipeline unable to produce results because it needs rep-seqs.qza and table.qza which contain all samples' information. Is any way that I can RUN Evaluate_diversity.sh command to generate out-put files or do I need to RUN pipeline from starting?

MaestSi commented 4 years ago

Are you saying that if you just remove BC*.fastq.gz files for the samples that you don't want, and run Evaluate_diversity.sh script you get an error? Or you also moved rep-seqs.qza?

sekhwal commented 4 years ago

I am just moving BC*.fastq.gz in different directory.

sekhwal commented 4 years ago

But this command needs rep-seqs.qza and table.qza to generate output files. Right?

MaestSi commented 4 years ago

More probably the error is due to manifest_1000_subsampled.txt, which you probably have not removed, after moving those samples. Therefore, the script is looking for them without finding them. Please try removing that file, and rerun the script.

sekhwal commented 4 years ago

And also I think I need to move sequence-metadata.tsv in the /analysis/ directory because otherwise it generates 1000_subsampled.txt files where the sequence-metadata.tsv located and then the pipeline shows the error.

MaestSi commented 4 years ago

You should specify with -m parameter the original sample-metadata.tsv file.

sekhwal commented 4 years ago

Hi, Its generate manifest_1000_subsampled.txt after running Evaluate_diversity.sh, therefore its showing error. Here are the generate files... Picture1

sekhwal commented 4 years ago

Still its generate all samples' files..

MaestSi commented 4 years ago

Because the directory fastq-files-backup that you generated is still inside the analysis dir. The script doesn't look just at the present folder, but also in subfolders. So, move that directory outside of it and it should work!

sekhwal commented 4 years ago

Do I need to change in sequence-metadata.tsv and manifest.txt as only three samples (BC08, BC09, BC10) as well?

MaestSi commented 4 years ago

No, you should not need to do that. Just remove the previously created manifest_1000_subsampled.txt file.

sekhwal commented 4 years ago

It works. However it is showing following errors.


(1/2) Invalid value for "--i-table": '/home/kumarm/MinION-data/MetONTIIME- done2/fast5_pass_analysis/analysis/table_1000_subsampled.qza' is not a valid filepath (2/2) Invalid value for "--i-phylogeny": '/home/kumarm/MinION- data/MetONTIIME-done2/fast5_pass_analysis/analysis/rooted- tree_1000_subsampled.qza' is not a valid filepath Picture1

sekhwal commented 4 years ago

I think it is generating files (table_1000_subsampled.qza etc.) out of /analysis directory since sequence-metadata.tsv is there..

MaestSi commented 4 years ago

Maybe you could try either moving the metadata file inside analysis directory, and see if it works. Sorry if I can't test it myself, but I'm quite busy at the moment.

sekhwal commented 4 years ago

I think eliminating other samples would not work for making groups in qiime. After getting output files of Evaluate_diversity.sh, I am trying to run qiime to calculate beta and alpha diversity. However, with only three (BC08, BC09, BC10) it is showing following error. I think it needs all multiple samples to make a bar plots...

Any suggestion?


Plugin error from diversity:

All values in the grouping vector are the same. This method cannot operate on a grouping vector with only a single group of objects (e.g., there are no ‘between’ distances because there is only a single group).

Debug info has been saved to /tmp/qiime2-q2cli-err-lh96uz64.log


Metadata file BC08 group1 BC09 group2 BC010 group3

MaestSi commented 4 years ago

Hi, I just made a test. I can confirm that the Evaluate_diversity.sh should work if in the working directory (specified by -w parameter) there are the fastq.gz files for the samples you want to analyse, and if you specify (with -m parameter) the sample-metadata.tsv file created by MetONTIIME.sh script, without having to create another one with only part of the samples. It may be that you are not succeeding to do that because you are using a sample-metadata.tsv file created by yourself, that maybe is not compliant. If this is the case, have you validated it with Keemei?

sekhwal commented 4 years ago

You mean, did you test qiime with output files of Evaluate_diversity.sh in order to make box plots?

I think, I do not need to remove other low reads samples if I need to calculate beta-group-significance visualizer since it performs a statistical test comparing groups with multiple replicates in a group. However, the problem is that I can only able to get beta-group-significance bar plot if I put two high read samples in a group either BC08 and BC09 (group 1) in a group or BC09 and BC10 (group3).

----------------------------- metadata file----------- sample-id body-site

q2:types categorical

BC08 group1 BC07 group1 BC03 group1 BC01 group1 BC04 group2 BC12 group2 BC11 group2 BC06 group2 BC09 group3 BC02 group3 BC10 group3 BC05 group3

Picture1

-------------command---------

qiime diversity beta-group-significance –i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza –m-metadata-file sample-metadata.tsv –m-metadata-column body-site –o-visualization core-metrics-results/unweighted-unifrac-body-site-significance.qzv –p-pairwise

MaestSi commented 4 years ago

No, I tested that the script was working, without giving the errors you reported. In my opinion results are meaningless if you don't use the same number of reads for each sample. In fact, subsampling a different number of reads for each sample, you are artificially inflating diversity for that sample.

sekhwal commented 4 years ago

Yes! Now I have weighted_unifrac_distance_matrix, weighted_unifrac_emperor etc. files. Can I open in excel or any other software and see the distance between the groups? Picture1

Thanks..

MaestSi commented 4 years ago

I think you can either have a look at qza files for seeing the diversity value for each sample, or you can run some commands similar to the ones you were running, and try to see if some groups of samples have different diversity values.

sekhwal commented 4 years ago

Ok! Thanks for your prompt responses.

sekhwal commented 4 years ago

Hi, Now, I can open and see distances in files produced from Evaluate_diversity.sh command. However, if I open bray_curtis_distance_matrix.qza, I see following distance, but that result is not appropriate. Any idea what went on with that calculation? There are numerous shared taxa, so the distance shouldn’t be 1.0, I don’t think. Do you think, is it because I am selecting -d 1000 reads while running Evaluate_diversity.sh.

----------bray_curtis_distance_matrix.qza------------- BC08 BC09 BC10 BC08 0.0 1.0 1.0 BC09 1.0 0.0 1.0 BC10 1.0 1.0 0.0

MaestSi commented 4 years ago

Hi, in the Evaluate_diversity.sh script that you are using, each read is considered a representative sequence. Since it is highly unlikely that 2 reads from 2 different samples (but also from the same sample) are identical all over their length due to sequencing errors, the beta-diversity metrics, as Bray-Curtis, resulted in maximum values. I now uploaded a new version of Evaluate_diversity.sh script that allows the user to specify a clustering threshold and performs clustering before computing diversity metrics. In this way you should be able to get more reliable metrics. For example, you could try 0.8 as a clustering threshold. Let me know if you check it out. Simone

sekhwal commented 4 years ago

Sure! So, do I need to take -d 1000 for this time also or should I increase reads?

MaestSi commented 4 years ago

Maybe it would be better to increase this number; in principle you would get the best results using 20000, to keep the same number of reads for all your "good" samples. As this may be very time consuming, you could consider to reduce this number to 5000 to do a preliminary test, and then, if results seem reasonable, increase this number to 20000.

sekhwal commented 4 years ago

Ok! Also, I would like to mention that when I RUN pipeline it produces files including table_1000_subsampled.qza, sequences_1000_subsampled.qza and rep-seqs_1000_subsampled.qza in a directory where my sequence-metadata.tsv file locates (/home/kumarm/MinION-data/MetONTIIME-done2), so the pipeline shows some ERROR because it unable to locate these in /analysis directory. However, I have to move these files in /analysis directory and re-run pipeline to produces matrix files.


MaestSi commented 4 years ago

Ok, I pushed a commit that should solve your issue. I didn't experience that issue because in my tests my current directory was always the working directory specified with -w.

sekhwal commented 4 years ago

Hi, Another way can be implemented for the matrix issue that why the Bray Curtis index would be calculated on sequences rather than on the genus-level OTU matrix. It’s easy to run the beta-diversity distance calculation on a feature table, why not do that for both the UniFrac and the Bray Curtis? We can get a more accurate result that way, because the sequences will have already been binned at the relevant taxonomic level. Please let me know if it make sense.

MaestSi commented 4 years ago

That is an option for sure. I didn't do it yet because it requires to retrieve the "error-free" sequence corresponding to the best hit in the database, and that may be not trivial. I will consider having a look into that.

sekhwal commented 4 years ago

Hi, Can you indicate which is the feature table in the output files of the pipeline that containing the samples over which beta diversity are computed. I am looking a file similar to that taxon-table-1.qza.

MaestSi commented 4 years ago

Beta-diversity is computed on rep-seqs_[sampling depth]_subsampled.qza. Remember that diversity metrics are computed independently from taxonomy assignment. Then, if two samples have low beta diversity, you probably can expect them to have also some shared taxa, but it is a different analysis. I would suggest you to have a look at QIIME2’s Moving pictures tutorial. I tried to use filenames as similar as possible, so you should be able to find a relationship.

sekhwal commented 4 years ago

I tried Evaluate_diversity.sh with -d 5000 and got following output in Bray Curtis matrix. Therefore, I am running the script with -d 20000.

  BC08 BC09 BC10
BC08 0 0.6986 0.5416
BC09 0.6986 0 0.6162
BC10 0.5416 0.6162 0
sekhwal commented 4 years ago

Hi, If I use rep-seqs_20000_subsampled.qza for the following qiime command, it is showing error. I am not sure which is the output file from pipeline that can be used as a feature table for the following command.

qiime diversity beta --i-table /home/kumarm/MinION-data/qiime/rep-seqs_20000_subsampled.qza --p-metric braycurtis --p-n-jobs 3 --output-dir /home/kumarm/MinION-data/qiime/betadiversity.csv

-------------ERROR--------- (1/1) Invalid value for "--i-table": Expected an artifact of at least type FeatureTable[Frequency]. An artifact of type FeatureData[Sequence] was provided.

sekhwal commented 4 years ago

However, I RUN pipeline with -d 20000 option and got following bray-curtis_distance-matrix result. Do you think these results make sense?


BC08    BC09    BC10

BC08 0.0 0.67475 0.53295 BC09 0.67475 0.0 0.58635 BC10 0.53295 0.58635 0.0

MaestSi commented 4 years ago

Hi, the only thing you understand from this analysis would be that BC08 sample is more similar to BC10 than to BC09, and that BC09 sample is more similar to BC10 than BC08. Since I don't know anything about your samples I can't say if these results make sense...

sekhwal commented 4 years ago

Ok! Thank you for your clarification.

sekhwal commented 4 years ago

Hi, Do you have idea why Bray Curtis distance matrix calculated from sequences very high compared to UniFrac.

---------Bray Curtis------------

  BC08 BC09 BC10
BC08 0 0.67475 0.53295
BC09 0.67475 0 0.58635
BC10 0.53295 0.58635 0

-------UniFrac----------

  BC08 BC09 BC10
BC08 0 0.410938 0.331465
BC09 0.410938 0 0.351139
BC10 0.331465 0.351139 0
MaestSi commented 4 years ago

Hi, I only know that UniFrac incorporates phylogenetic relationships between the features, so I hypothesize that sequences might end up in different clusters due to biological diversity (good) and due to sequencing errors (bad); but then, if the software has a look at those clusters and tries to figure out how much they differ through the incorporation of phylogenetic distance, then it finds out that in many cases those clusters are actually similar, and so Bray Curtis was inflated. Mind that I'm by no means expert about this topic, so I'd encourage you to ask for advice in QIIME2 forum.