NP-Omix / BioCompass

Other
5 stars 3 forks source link

Some possible enhancements after publishing first version #16

Open tiagolbiotech opened 8 years ago

tiagolbiotech commented 8 years ago

Before selecting the cutoffs

Modifications for the current code:

  1. Add hits within same genome and remove duplicates (e.g. same BGC from NCBI + personal database);
  2. Improve subclustering (matrix rules?) in order to remove multiple self loops;
  3. Refine cluster's boarders removing unique hypothetical genes;

    New additions to the code:

  4. Calculate average Jaccard Index between all gene cluster that are in the network for domains, creating a second similarity matrix. Then use DBScan to separate the gene cluster into groups;
  5. Make the network output as an interactive chart (just like Numbers does), named calibration graph, allowing to see the networks to change throughout a range of cutoffs, highlighting family of "gold standards BGCs" (just like an "internal standard") and using second DBScan groups to color nodes; PS: only include edges for biosynthetic or hypothetical (uncolored)

    After selecting the cutoffs

  6. Add (a better) filtering script, where the user will point the best cutoff he could find using this calibration graph;
  7. Automatically generate output images (one with and other w/o regulatory/mobile/resistance genes) for the selected network (using NetworkX?), but also provide cytoscape output;
  8. Add multiple gene alignment images upon clicking family in the outpout;

    For future

  9. Run analysis on multiple samples (multiCOMPASS module?). Suggestion: run analysis for the genome with most BGCs, then loop until all BGCs from query are in the final network.

    Remaining challanges

  10. How to improve subclustering rules?
  11. How to better select best DBScan subclustering itineration?
  12. How to select best EPS for Jaccard index DBScan?
  13. How to run multiple samples with different subclustering?