Knowledge-Graph-Hub / kg-microbe

https://knowledge-graph-hub.github.io/kg-microbe/index.html
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

ingest gene knockout data from LBL microbial fitness experiments #9

Open realmarcin opened 3 years ago

realmarcin commented 3 years ago

All of the data is here (84G total): http://genomics.lbl.gov/supplemental/bigfit/

The numerical relative growth data would have to be converted - growth vs no growth, via eg thresholding.

Just taking the first organism as an example: http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/

On the organism page, under 'Genes' the 'Specific phenotypes' link gives a table of most significant phenotype per gene for this KO dataset: http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/specific_phenotypes and this file can serve as the primary data source. These columns:

sysName desc name lrn t Group Condition_1 Concentration_1 Units_1

provide the following data:

gene name description internal name log ratio normalized t-statistic condition group condition name concentration unit

For reference under 'Genes' the 'Gene fitness' link gives a full table of relative fitness values: http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/fit_logratios_good.tab The y-axis labels are 'locusId' which are gene ids and the x-axis labels are condition (sample) ids including a text description.

There is additional data on each condition on the organism page under 'Tables' then 'Experiments' then 'Detailed metadata for experiments': http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/expsUsed

A basic ingest of this data would model as mutant alleles or a gene-condition relation indicating that this gene X is essential for growth in condition Y. As key supporting data the gene annotations should also be ingested: http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/fit_genes.tab with the caveat that these are 'free text' annotations so may require standardization.

Further ingests could include:

cmungall commented 3 years ago

As key supporting data the gene annotations should also be ingested: http://genomics.lbl.gov/supplemental/bigfit/html/acidovorax_3H11/fit_genes.tab with the caveat that these are 'free text' annotations so may require standardization.

Can we not just get the annotations from uniprot? The challenge here is there is no shared ID between the files

This line

Ac3H11_2265 NA 1 scaffold5/16 499203 500063 + NA FIG146518: Zn-dependent hydrolases, including glyoxylases 0.6574 7 TRUE

May correspond to https://www.uniprot.org/uniprot/A0A165JRD7 ?

Do we have the AA sequences easily accessible?