bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement
26 stars 8 forks source link

Error running paprica #78

Closed eperezv closed 3 years ago

eperezv commented 3 years ago

Hello,

I am having problems to properly run paprica in one of my datasets. Do you have any clue about? I get the following error:


generating data for edge Bacteroidetes_939
generating data for edge Bacteroidetes_944
generating data for edge Bacteroidetes_959
generating data for edge Bacteroidetes_960
generating data for edge Bacteroidetes_98
generating data for edge Bacteroidetes_981
generating data for edge Bacteroidetes_984
generating data for edge Bacteroidetes_986
generating data for edge Candidatus_Saccharibacteria_10
Traceback (most recent call last):
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Candidatus_Saccharibacteria_10'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/eduardo/tools/paprica-git/paprica-tally_pathways.py", line 234, in <module>
    edge_data.loc[edge, 'taxon'] = lineages.loc[edge, 'consensus']
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexing.py", line 873, in __getitem__
    return self._getitem_tuple(key)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexing.py", line 786, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexing.py", line 1110, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexing.py", line 1059, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/generic.py", line 3488, in xs
    loc = self.index.get_loc(key)
  File "/home/eduardo/miniconda3/envs/paprica/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 'Candidatus_Saccharibacteria_10'
´´´
bowmanjeffs commented 3 years ago

Hmmmm... I can't reproduce this error in a test file that includes a read that places to the Candidatus_Saccharibacteria reference tree. Make sure that you've cloned the most recent version of the repository, and if the issue persists, send me your analysis file.

eperezv commented 3 years ago

The error is not being produced with the test file but with one of my datasets. The git version was up-to-date and working with the test so I assume it is not related to this.

By try the latest package (not git), I was able to run it properly. I can help to reproduce the issue is you are interested in.

bowmanjeffs commented 3 years ago

The repository has advanced considerably since the last stable release (and I'd strongly recommend using it if we can identify the problem), so I'd like to understand why your file is failing. Are you willing to share either the full fasta or a subset that is sufficient to reproduce the error?

eperezv commented 3 years ago

Yes, sure. I will prepare a subset because the original file is very big. I'll let you know.

eperezv commented 3 years ago

I reproduced it with a small dataset (focused on ASVs from the group it is being problematic). You can download it from here: https://drive.google.com/file/d/1-8RBHHMjjgMwrHptlNyMz7VueUigCu-F/view?usp=sharing

bowmanjeffs commented 3 years ago

Got it, there was an outdated database file. Should be fixed now - subset file runs without error.

eperezv commented 3 years ago

Thanks. It is solved now.

I have a different issue now or I don't know how to deal with now. There are nodes (and seqs) in the XXX.bacteria.combined_16S.bacteria.tax.placements.csv file that does not exist in the XXX.bacteria.edge_data. This was not happening in the previous version. What does it mean?

bowmanjeffs commented 3 years ago

That's a great question and a reminder that I need to get the docs updated. Those are reads that placed to internal nodes on the master (phylum_refs) tree, and thus did not get a final analysis on the phylum trees. As a result we do not attempt an metabolic inference on those reads - they're really distant from RefSeq genomes - but they are accounted for in the unique_seqs.csv file which is the recommended file for any community structure analysis. If it's highly advantageous for them the appear in edge_data (with NA values as appropriate) let me know and can make that change. If this problem seems to be something else definitely let me know.

eperezv commented 3 years ago

Thanks for your answer. I was interested in using paprica for calculating community-weighted means (CWM), that could be helpful to explain patterns in microbial communities. Therefore, I was combining the files to know every OTU/ASV guessed trait. Since this version now exclude some of taxa (probably too far from any known sequenced bacteria), I have to decide whether to remove them from the analyses or other solution.

bowmanjeffs commented 3 years ago

Understood. If you're focusing on traits I recommend removing those taxa. They are so deeply placed that it would not be wise to attempt a metabolic inference. Hopefully they are a relatively minor component of the community. If you find that that is not the case let me know, and I'm happy to take a look at some representative reads to see if we can optimize placement.

eperezv commented 3 years ago

Hello, I am back with this. I will evaluate how important (in terms of rel. abundance) are the taxa that are not included.

In the meanwhile, I am not sure if there is any technical or conceptual decision but for me, it could be very helpful to directly have e.g. the 16S estimation with every OTU (in the same table).

I was also thinking about something. Is there any way of knowing how far the assigned taxa are from their reference genomes? I mean something like the nearest-sequenced taxon index, NSTI (as used on picrust2).

bowmanjeffs commented 3 years ago

Can you describe what you mean by 16S estimation? Abundance, corrected abundance (abundance/16S gene copies), and 16S gene copies are all in the edge_data.csv output file. Also in that file you'll see columns map_ratio and map_id. Those tell you the fraction of positions and total positions shared between the query and reference respectively. The EDPL value is also useful as an indicator of placement certainty.

eperezv commented 3 years ago

I meant the number of copies of the 16S of each OTU/ASV. Yes, the information is in the edge_data.csv, but there, it is not possible to find which OTU matches with every group (that information is on the tax.placements.csv file)

Thanks for letting me know about the map and EDPL variables

bowmanjeffs commented 3 years ago

Ah, yes, because the number of 16S rRNA gene copies is determined at the edge level and not the ASV level you would simply map that edge information to the ASV data (note that ASV abundance has already been normalized). The ASVs are labeled [seq].[edge] or [seq]_[edge] (depending on which version of paprica you're working with) just for this purpose.

eperezv commented 3 years ago

Ok, right. You mentioned that map_ratio and map_id are informative about shared between the query and reference. However, I see in my dataset that many rows are empty. Is that normal?

bowmanjeffs commented 3 years ago

Yes, those are placements to internal nodes (closest estimated genomes in our lingo). For those cases there isn't a definitive reference sequence to measure against, so map_id and map_ratio are not reported.

eperezv commented 3 years ago

Ok.

I have been checking several taxa that were not included in the estimations because they might be too far from the references. This one is highly relevant in my dataset:

>990b1d318316a4d5ee31fd0007f70b97
TACGAAGGGTGCAAGCGTTACTCGGAATTACTGGGCGTAAAGCGTGCGTAGGTGGTTGTTTAAGTCTGTTGTGAAAGCCCTGGGCTCAACCTGGGAACTGCAGTGGAAACTGGACAACTAGAGTGTGGTAGAGGGTAGCGGAATTCCCGGTGTAGCAGTGAAATGCGTAGAGATCGGGAGGAACATCCATGGCGAAGGCAGCTACCTGGACCAACACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACAGG

I performed some exploratory analyses on this dataset with picrust2 some time ago and I've just checked that this taxon had very low NSTI (0.01), so it might have a close fully-sequenced representative.

Of course, I am aware that tools work differently, and that can provide dissimilar results. But I would also like to learn and maybe help to make the tool fit better to this/other datasets.

bowmanjeffs commented 3 years ago

That might be the case, you should view the tree and see where it placed. It's possible that there are very closely related reference sequences for that part of the tree, so placement was still internal, even though this read is very similar.

eperezv commented 3 years ago

Ok. Can you briefly guide me on how to check this? I have been exploring the output files, one by one. The only one that I've found is one ending in ".json". I've opened it, but I feel the placements are not rightly inserted in the tree. Maybe I should check another intermediate file?

bowmanjeffs commented 3 years ago

Use archaeopteryx or an alternate tree viewer on the phyloxml format tree file, or use gappa to explore the jplace file. Also, since this isn't related to a paprica bug let's continue the discussion over email. Just email me directly.