Closed nanjiangshu closed 3 years ago
I have had a first look at this PR. @nanjiangshu Since this is a quite extensive PR in terms of changes made and number of commits, could you please add a summary to describe what the new files should look like? For example: "Prior to this change, the output file would contain the following columns, where as this change adds/removes/changes these columns in these files.".
It is great that you have refactored a lot of stuff out of
index.js
. I notice that both theindex.js
and thefunc.js
contain some logic for writing to files. It is not clear to me why not all of the file writing logic is not extracted out ofindex.js
. Overall I think some more distinct separation of concerns could be added. For example I might imagine a quite smallindex.js
accompanied by the following files:* `parser.js`: a list of functions that are only responsible for parsing data files * `writer`: a list of functions that are only responsible for taking output of `parser.js` and writing to csv files * `utils.js`: a list of helper functions like `trim` and `toLabelCase` that are used in other files
Then the
index.js
could just focus on reading the CLI params and piping the return ofparser.js
towriter.js
.
Hi Shan @e0, thanks very much for your comments. I think it is a very good suggestion that we further split the func.js
to several javascript files based on the functionality of modules.
I haven't moved all file processing code out of index.js
is because while I was doing that, I realized many of these code may probably never been re-used. I stopped at some point in case some of you may think it is not necessary to move these lines of code to a separate file. I can keep on moving the rest of them out of index.js
if you think it is necessary.
The format of the output at the folder ./data
is exactly the same as before. What has been changed are mainly
createPMIDFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L87
REACTIONS.tsv
but now from YMAL file.extractGeneAnnotation() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L90
GENES.tsv
but now from genes-new.tsv
createComponentExternalDbFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L100
{component}_EID.tsv
but now from {component}-news.tsv
The format of the output at the folder
./data
is exactly the same as before. What has been changed are mainly
createPMIDFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L87
- Before PMID info is obtained from the file
REACTIONS.tsv
but now from YMAL file.extractGeneAnnotation() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L90
- Before the information is obtained from the file
GENES.tsv
but now fromgenes-new.tsv
- createComponentExternalDbFile() https://github.com/MetabolicAtlas/neo4j-data-generation/blob/feat/new-TSV-MA-908/index.js#L100
- Before the information is obtained from the files
{component}_EID.tsv
but now from{component}-news.tsv
Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv
file. Is this expected?
I also noticed in line 113 from parser.js
the following table column names are expected:
const [ geneId, geneENSTID, geneENSPID, geneUniProtID, name, geneEntrezID, alternateName, synonyms] = lines[i].split('\t').map(e => utils.trim(e, '"'));
However in the https://github.com/MetabolicAtlas/data-files/blob/feat/new-TSV-format/integrated-models/Human-GEM/genes-new.tsv file, I see these column names:
genes geneENSTID geneENSPID geneUniProtID geneSymbols geneEntrezID geneNames geneAliases
. Not sure if I'm looking at the wrong place but please let me know if I'm misunderstanding this mismatch in column names.
Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv file. Is this expected?
This is because many of the entries for geneENSTID and geneENSPID have multiple records and they are written in separate lines in the file HumanGemV1_6_0.geneExternalDbs.csv
Thank you, this is really helpful. I haven't looked through everything yet but I noticed there is a lot more gene annotations compared to before. For example, for HumanGem V1.6.0, previously (using GENES.tsv) we had 14479 lines and now (using genes-new.tsv) there is 60440 lines in the HumanGemV1_6_0.geneExternalDbs.csv file. Is this expected?
This is because many of the entries for geneENSTID and geneENSPID have multiple records and they are written in separate lines in the file
HumanGemV1_6_0.geneExternalDbs.csv
Ok, I'm not too familiar with the newer data files but it sounds like it is correct that we are generating a much bigger externalDs.csv
file now. Thanks for the explanation.
I also noticed in line 113 from parser.js the following table column names are expected: const [ geneId, geneENSTID, geneENSPID, geneUniProtID, name, geneEntrezID, alternateName, synonyms] = lines[i].split('\t').map(e => utils.trim(e, '"')); However in the https://github.com/MetabolicAtlas/data-files/blob/feat/new-TSV-format/integrated-models/Human-GEM/genes-new.tsv file, I see these column names: genes geneENSTID geneENSPID geneUniProtID geneSymbols geneEntrezID geneNames geneAliases. Not sure if I'm looking at the wrong place but please let me know if I'm misunderstanding this mismatch in column names.
The variable names name
, alternateName
and synonyms
are used previously and I was trying to preserve the old variable names as much as possible when I change the code. It is probably better to match the variable names with the header names in the file genes-new.tsv
. However, the names in the header of genes-new.tsv
are not always perfect. For example, the long gene names should be called alternateName
.
The variable names
name
,alternateName
andsynonyms
are used previously and I was trying to preserve the old variable names as much as possible when I change the code. It is probably better to match the variable names with the header names in the filegenes-new.tsv
. However, the names in the header ofgenes-new.tsv
are not always perfect. For example, the long gene names should be calledalternateName
.
That makes sense, sorry I should have looked more carefully before posting.
This pull request implements the parser for the new TSV format for models in connection to the issue MA-908 .
It also made some code refactoring so that
index.js
is more modulated.To test the new parser, please check out the branch
feat/new-TSV-format
for the repodata-files
Need to be mentioned that this pull request depends on https://github.com/MetabolicAtlas/data-files/pull/7