MetabolicAtlas / data-generation

Process the raw data-files for ingestion into the Neo4j database
MIT License
0 stars 0 forks source link

feat: remove code to parse HPA json file that is not needed anymore #23

Closed nanjiangshu closed 3 years ago

nanjiangshu commented 3 years ago

This PR together with PR 18 and PR 699 closes #124

Since the HPA data for DataOverlay will be provided as the transcriptomics tsv file for Human-GEM, the parsing code in data-generation is no longer needed.

mihai-sysbio commented 3 years ago

Hmm I'm not sure about this one.

It might be that between subsequent versions, new genes are added to Human-GEM. This doesn't happen that often though. Anyway, I'm wondering if it's not more responsible of us to do the filtering on-demand, at each release, rather than having a static dataset that needs to be filtered manually.

nanjiangshu commented 3 years ago

Hmm I'm not sure about this one.

It might be that between subsequent versions, new genes are added to Human-GEM. This doesn't happen that often though. Anyway, I'm wondering if it's not more responsible of us to do the filtering on-demand, at each release, rather than having a static dataset that needs to be filtered manually.

For filtering do you mean filtering of the genes that are in the HPA data but do not exists in the Human-GEM model? It is done with this line in the parsing script. I don't think it is a big problem with the filtering. We update the hpaRna.tsv when rna_tissue_hpa.tsv is updated or Human-GEM.yml is updated and I guess it won't happen that frequent. If we really want to automate it, we could add a version tag and integrate the formatting script to generate-data.

mihai-sysbio commented 3 years ago

For filtering do you mean filtering of the genes that are in the HPA data but do not exists in the Human-GEM model? It is done with this line in the parsing script. I don't think it is a big problem with the filtering. We update the hpaRna.tsv when rna_tissue_hpa.tsv is updated or Human-GEM.yml is updated and I guess it won't happen that frequent. If we really want to automate it, we could add a version tag and integrate the formatting script to generate-data.

Sounds good to me. Could you then please update https://github.com/MetabolicAtlas/data-files/blob/main/DATA_OVERLAY.md with a section detailing this update procedure?

nanjiangshu commented 3 years ago

For filtering do you mean filtering of the genes that are in the HPA data but do not exists in the Human-GEM model? It is done with this line in the parsing script. I don't think it is a big problem with the filtering. We update the hpaRna.tsv when rna_tissue_hpa.tsv is updated or Human-GEM.yml is updated and I guess it won't happen that frequent. If we really want to automate it, we could add a version tag and integrate the formatting script to generate-data.

Sounds good to me. Could you then please update https://github.com/MetabolicAtlas/data-files/blob/main/DATA_OVERLAY.md with a section detailing this update procedure?

I've created a new issue for it.