how is `strain_mappings.csv` generated?

NPLinker / nplinker

A python framework for data mining microbial natural products by integrating genomics and metabolomics data

https://nplinker.github.io/nplinker

Apache License 2.0

17 stars 13 forks source link

how is `strain_mappings.csv` generated? #148

Closed CunliangGeng closed 1 year ago

CunliangGeng commented 1 year ago

[x] check code on the generation of strain_mappings.csv file
- [x] clean up relevant code: loader.py, downloader.py, etc ( #149 )
[x] when and how much is manual input needed?

justinjjvanderhooft commented 1 year ago

@CunliangGeng the strain mapping file is normally manually provided by the user - it is the key information that links the genomics data to the metabolomics information. Only when downloaded from the PoDP, these connections are automatically loaded into NPLinker. Of course, this step in the process is tricky as the user may not completely get the format of the mapping file correct. Do you have any suggestions on how to improve this step and make it less "error-prone" and thus more "robust"?

CunliangGeng commented 1 year ago

@justinjjvanderhooft This step is indeed a pain point. I think we could take the following measures to improve it:

Treat PODP data as the preferred input of NPLinker. This is the original design or logic of NPLinker. We are also following this logic to refactor the code but making the PODP pipeline more modular.
As for manual input
- providing guide on generating the mapping file will help a bit but not much.
- It would be more helpful to develop a GUI tool (something like cffinit). This GUI tool should run on user's machine, guide user step by step on filling in e.g. strain ids and detecting local files (e.g. BGC, MS), and then output the mapping file. Such a tool will make manual input less "error-prone".

justinjjvanderhooft commented 1 year ago

Thanks for the suggestions. I agree that PoDP is a great entry point, but in practice many users will start from local files - and possibly already run BiG-SCAPE results and/or Molecular Networking runs. The GUI tool sounds like a great suggestion - how much work would that be? It may be a nice aim for an intern?

CunliangGeng commented 1 year ago

It cost more than half a year in total for experienced engineers to develop cffinit (see the dev history plot). So I guess the GUI tool would require similar amount of effort. I think it's a very good internship project.

justinjjvanderhooft commented 1 year ago

Wow, that is quite an effort indeed. Something to consider - if there is an intern interested, please do encourage to take up this challenge - at least we could make a start with it.... We could re-use bits and pieces of the PoDP add form, as in one of the steps, we basically create the mapping file from previously generated information and direct links to the publicly available metabolomics datafiles....

CunliangGeng commented 1 year ago

This tool should not be run in a browser, as browser will restrict the tool from detecting files on the user's machine. So I don't think we could reuse PODP code (web app running in browser). The tool is better to be a desktop application with graphical user interface.