DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

KeyError: 'Filename' in plot_all_results.py #21

Closed StepanSaenko closed 4 months ago

StepanSaenko commented 1 year ago

Hello!

I was trying to generate some plots using plot_all_results.py. At first, I ran it this way: /home/saenkos/OMArk/utils/plot_all_results.py -i ./ and everything was fine.

But then I tried to use the -m key with a file and got the error:

Traceback (most recent call last):
  File "/home/saenkos/OMArk/utils/plot_all_results.py", line 250, in <module>
    main_df, cont_df = integrate_external_data(main_df, cont_df, mapping_file, taxonomy_order=arg.taxonomy)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/saenkos/OMArk/utils/plot_all_results.py", line 148, in integrate_external_data
    cont_df['Species name'] = [mapping.get(x,{}).get('Species name',x) for x in cont_df['Filename']]
                                                                                ~~~~~~~^^^^^^^^^^^^
  File "/home/saenkos/anaconda3/envs/myenv/lib/python3.11/site-packages/pandas-2.0.0rc1-py3.11-linux-x86_64.egg/pandas/core/frame.py", line 3745, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/saenkos/anaconda3/envs/myenv/lib/python3.11/site-packages/pandas-2.0.0rc1-py3.11-linux-x86_64.egg/pandas/core/indexes/range.py", line 349, in get_loc
    raise KeyError(key)
KeyError: 'Filename'

I checked the code, and here is my question:

Do the line 148 cont_df['Species name'] = [mapping.get(x,{}).get('Species name',x) for x in cont_df['Filename']] and variable cont_df take part in a plot generating?

I mean, the plot-generating command has only main_df variable: plot_omark_df(main_df, savefile=output_figure)

When I commented it, the image was generated, but it was the same as before, with no differences using the -m key

Could you please provide examples of plots with provided mapping file and without?

Thank you.

YanNevers commented 1 year ago

Hello @StepanSaenko !

I am sorry for this issue. I believe this happen when no contaminant is detected by OMArk in the input proteome and a mapping file is provided. It is not well handled, but I should be able to correct it quickly. You are right that cont_df, which is a dataframe reporting the contaminants found in the dataset, is not used to create the plot. Commenting the line as you did will have no impact on these results.

As an example, here is two plots - with or without mapping files on a toy dataset: without_mapping omark_multiple

If all goes well, only the species name should change in the plot (and the order of species, if you used -t as well). To provide more context, here are the OMArk folders in my directory structure: UP000000437_7955 UP000000589_10090 UP000005640_9606

And here is the mapping file:

   Filename Species name    TaxId
   UP000000437_7955 Danio rerio  7955
   UP000000589_10090    Mus musculus    10090
   UP000005640_9606 Homo sapiens    9606`

If you do not see change in the newest files after commenting, it maybe because of a formatting issue in this file. If you do not see it, please provide me a copy of it and I will look into it.

As for cont_df. we generate it here because the functions of this script are imported in the associated Jupyter Notebook which provides an interface to explore the data interactively. Having access to list of contaminant in the whole dataset can be useful in this context .

I should be able to fix this mistake and provide a corrected version of this script soon where there will be no crash in this situation, today if possible. In the meantime, fill free to continue to use you commented version since it should have no impact into the rest of the script.

StepanSaenko commented 1 year ago

Thank you!

YanNevers commented 10 months ago

Should been fixed with release with commit 21f19b6 .