biocore / q2-qemistree

Hierarchical orderings for mass spectrometry data. Canonically pronounced "chemis-tree".
BSD 2-Clause "Simplified" License
31 stars 16 forks source link

Updated codes for Sirius 4.8.2 #150

Closed helenamrusso closed 2 years ago

helenamrusso commented 2 years ago

Code adapted to work with Sirius 4.8.2, with the new command-line interface (>4.4.29). It seems to be working fine as the outputs are being generated correctly (comparing to old datasets), but please double-check.

FranckLejzerowicz commented 2 years ago

Hi @helenamrusso I had to install this dev branch of your to run Qemistree to accommodate the new sirius version. Qemistree worked fine in qiime as could obtain lots of data in the key output files:

3.4G -rw-r--r--  1 flejzerowicz knightlab 3.5G Jul 31 10:14 fingerprints.qza
949M -rw-r--r--  1 flejzerowicz knightlab 1.1G Jul 30 23:31 fragmentation_trees.qza
946M -rw-r--r--  1 flejzerowicz knightlab 1.1G Jul 31 03:25 molecular_formulas.qza

But now, I have an issue with this command:

qiime qemistree make-hierarchy \
--i-csi-results /projects/nutrition/foodomics/qemistree/fingerprints.qza \
--i-feature-tables /projects/nutrition/foodomics/qemistree/FEATURE-BASED-MOLECULAR-NETWORKING-d0797f2a-download_qza_table_data-main.qza \
--o-tree /projects/nutrition/foodomics/qemistree/qemistree.qza \
--o-feature-table /projects/nutrition/foodomics/qemistree/feature-table-hashed.qza \
--o-feature-data /projects/nutrition/foodomics/qemistree/feature-data.qza

The error (below) is related to a temporary directory being deleted too early. Not sure if I should open an issue on this as it is baed on using this non-merged branch

Traceback (most recent call last):
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__
   results = action(**arguments)
 File "</home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/decorator.py:decorator-gen-217>", line 2, in make_hierarchy
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
   output_types, provenance)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
   output_views = self._callable(**view_args)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_qemistree/_hierarchy.py", line 126, in make_hierarchy
   qc_properties, metric)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_qemistree/_process_fingerprint.py", line 95, in process_csi_results
   collated_fps = collate_fingerprint(csi_result, qc_properties, metric)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_qemistree/_process_fingerprint.py", line 45, in collate_fingerprint
   index_col='relativeIndex', dtype=str, sep='\t')
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
   return _read(filepath_or_buffer, kwds)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
   parser = TextFileReader(fp_or_buf, **kwds)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
   self._make_engine(self.engine)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
   self._engine = CParserWrapper(self.f, **self.options)
 File "/home/flejzerowicz/usr/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
   self._reader = parsers.TextReader(src, **kwds)
 File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
 File "pandas/_libs/parsers.pyx", line 689, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/panfs/panfs1.ucsd.edu/panscratch/flejzerowicz/dnn.qmstr_1536352/qiime2-archive-1vnkrtx5/493f64ca-ee65-4e90-9831-b19dcb729ea0/data/csi-output/fingerprints.csv' does not exist: b'/panfs/panfs1.ucsd.edu/panscratch/flejzerowicz/dnn.qmstr_1536352/qiime2-archive-1vnkrtx5/493f64ca-ee65-4e90-9831-b19dcb729ea0/data/csi-output/fingerprints.csv'

This is weird because temporary-files in panasas were used without issue for the previous qemistree steps. It seems that Qemistree created and deleted this fingerprints.csv temporary file before using it.

Thanks for any clue! Franck

helenamrusso commented 2 years ago

Hi @FranckLejzerowicz Good to know that you got the output files, as expected! I never had this error you described, but it seems it is a small detail in the _process_fingerprint.py file. I just realized that I'm using an outdated version of pandas, and it was working fine for me. I talked with @anupriyatripathi and we fixed the issue, check the "fix pandas loc to reindex" modification and I hope it will work now for you!

Thanks! Helena

FranckLejzerowicz commented 2 years ago

Hi @helenamrusso Somehow, my problem persist and I can't figure why. I have made a couple prints to check the content of the csi_results and see if the file said missing is indeed missing:

def collate_fingerprint(csi_result: CSIDirFmt, qc_properties: bool = False,
                        metric: str = 'euclidean'):
    '''
    This function collates predicted chemical fingerprints for mass-spec
    features in an experiment.
    '''
    if isinstance(csi_result, CSIDirFmt):
        csi_result = str(csi_result.get_path())
    print(csi_result)
    fpfoldrs = os.listdir(csi_result)
    print(fpfoldrs)

and it shows:

/panfs/panfs1.ucsd.edu/panscratch/flejzerowicz/qiime2-archive-e216o7gj/493f64ca-ee65-4e90-9831-b19dcb729ea0/data/csi-output
['formula_identifications_adducts.tsv', 'canopus_summary_adducts.tsv', 'csi_fingerid.tsv', 'csi_fingerid_neg.tsv', 'formula_identifications.tsv', 'compound_identifications.tsv', 'compound_identifications_adducts.tsv', 'canopus_summary.tsv', 'report.mztab', '0_features_FEATURE_1233', '1_features_FEATURE_5513', '2_features_FEATURE_7501', [...etc...]

Hence, it indeed looks like the file that later is (attempted) read:

    substructrs = pd.read_csv(os.path.join(csi_result, 'fingerprints.csv'),
                              index_col='relativeIndex', dtype=str, sep='\t')

fails because 'fingerprints.csv' does not exist (it should be in the fpfoldrs list printed above, right?)

That's weird because it seems like a valide file generated using qemistree:

$ qiime tools peek /projects/nutrition/foodomics/qemistree/fingerprints.qza
UUID:        493f64ca-ee65-4e90-9831-b19dcb729ea0
Type:        CSIFolder
Data format: CSIDirFmt

Command run:

qiime qemistree make-hierarchy \
    --i-csi-results /projects/nutrition/foodomics/qemistree/fingerprints.qza \
    --i-feature-tables /projects/nutrition/foodomics/qemistree/FEATURE-BASED-MOLECULAR-NETWORKING-d0797f2a-download_qza_table_data-main.qza \
    --o-tree /projects/nutrition/foodomics/qemistree/qemistree.qza \
    --o-feature-table /projects/nutrition/foodomics/qemistree/feature-table-hashed.qza \
    --o-feature-data /projects/nutrition/foodomics/qemistree/feature-data.qza