MetabolicAtlas / data-generation

Process the raw data-files for ingestion into the Neo4j database
MIT License
0 stars 0 forks source link

feat: new annotation in tsv format MA-908 #16

Closed nanjiangshu closed 3 years ago

nanjiangshu commented 3 years ago

This pull request implements the parser for the new TSV format for models in connection to the issue MA-908 .

It also made some code refactoring so that index.js is more modulated.

To test the new parser, please check out the branch feat/new-TSV-format for the repo data-files

Need to be mentioned that this pull request depends on https://github.com/MetabolicAtlas/data-files/pull/7

nanjiangshu commented 3 years ago

I have had a first look at this PR. @nanjiangshu Since this is a quite extensive PR in terms of changes made and number of commits, could you please add a summary to describe what the new files should look like? For example: "Prior to this change, the output file would contain the following columns, where as this change adds/removes/changes these columns in these files.".

It is great that you have refactored a lot of stuff out of index.js. I notice that both the index.js and the func.js contain some logic for writing to files. It is not clear to me why not all of the file writing logic is not extracted out of index.js. Overall I think some more distinct separation of concerns could be added. For example I might imagine a quite small index.js accompanied by the following files:

* `parser.js`: a list of functions that are only responsible for parsing data files

* `writer`: a list of functions that are only responsible for taking output of `parser.js` and writing to csv files

* `utils.js`: a list of helper functions like `trim` and `toLabelCase` that are used in other files

Then the index.js could just focus on reading the CLI params and piping the return of parser.js to writer.js.

Hi Shan @e0, thanks very much for your comments. I think it is a very good suggestion that we further split the func.js to several javascript files based on the functionality of modules.

I haven't moved all file processing code out of index.js is because while I was doing that, I realized many of these code may probably never been re-used. I stopped at some point in case some of you may think it is not necessary to move these lines of code to a separate file. I can keep on moving the rest of them out of index.js if you think it is necessary.

The format of the output at the folder ./data is exactly the same as before. What has been changed are mainly

  1. createPMIDFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L87

    • Before PMID info is obtained from the file REACTIONS.tsv but now from YMAL file.
  2. extractGeneAnnotation() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L90

    • Before the information is obtained from the file GENES.tsv but now from genes-new.tsv
  3. createComponentExternalDbFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L100

    • Before the information is obtained from the files {component}_EID.tsv but now from {component}-news.tsv
e0 commented 3 years ago

The format of the output at the folder ./data is exactly the same as before. What has been changed are mainly

  1. createPMIDFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L87

    • Before PMID info is obtained from the file REACTIONS.tsv but now from YMAL file.
  2. extractGeneAnnotation() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L90

    • Before the information is obtained from the file GENES.tsv but now from genes-new.tsv
  3. createComponentExternalDbFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L100
  • Before the information is obtained from the files {component}_EID.tsv but now from {component}-news.tsv

Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv file. Is this expected?

I also noticed in line 113 from parser.js the following table column names are expected: const [ geneId, geneENSTID, geneENSPID, geneUniProtID, name, geneEntrezID, alternateName, synonyms] = lines[i].split('\t').map(e => utils.trim(e, '"')); However in the https://github.com/MetabolicAtlas/data-files/blob/feat/new-TSV-format/integrated-models/Human-GEM/genes-new.tsv file, I see these column names: genes geneENSTID geneENSPID geneUniProtID geneSymbols geneEntrezID geneNames geneAliases. Not sure if I'm looking at the wrong place but please let me know if I'm misunderstanding this mismatch in column names.

nanjiangshu commented 3 years ago

Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv file. Is this expected?

This is because many of the entries for geneENSTID and geneENSPID have multiple records and they are written in separate lines in the file HumanGemV1_6_0.geneExternalDbs.csv

e0 commented 3 years ago

Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv file. Is this expected?

This is because many of the entries for geneENSTID and geneENSPID have multiple records and they are written in separate lines in the file HumanGemV1_6_0.geneExternalDbs.csv

Ok, I'm not too familiar with the newer data files but it sounds like it is correct that we are generating a much bigger externalDs.csv file now. Thanks for the explanation.

nanjiangshu commented 3 years ago

I also noticed in line 113 from parser.js the following table column names are expected: const [ geneId, geneENSTID, geneENSPID, geneUniProtID, name, geneEntrezID, alternateName, synonyms] = lines[i].split('\t').map(e => utils.trim(e, '"')); However in the https://github.com/MetabolicAtlas/data-files/blob/feat/new-TSV-format/integrated-models/Human-GEM/genes-new.tsv file, I see these column names: genes geneENSTID geneENSPID geneUniProtID geneSymbols geneEntrezID geneNames geneAliases. Not sure if I'm looking at the wrong place but please let me know if I'm misunderstanding this mismatch in column names.

The variable names name, alternateName and synonyms are used previously and I was trying to preserve the old variable names as much as possible when I change the code. It is probably better to match the variable names with the header names in the file genes-new.tsv. However, the names in the header of genes-new.tsv are not always perfect. For example, the long gene names should be called alternateName.

e0 commented 3 years ago

The variable names name, alternateName and synonyms are used previously and I was trying to preserve the old variable names as much as possible when I change the code. It is probably better to match the variable names with the header names in the file genes-new.tsv. However, the names in the header of genes-new.tsv are not always perfect. For example, the long gene names should be called alternateName.

That makes sense, sorry I should have looked more carefully before posting.