Open jcmatese opened 9 months ago
Sorry, it appears that it's the tracebase
example data that should be updated. I'll re-create this issue there. However, that said, we should probably document somewhere that for actual production data loads, we should be using the consolidated compounds file in the Rabinowitz repo, not from the example data? Unless I'm misunderstanding or not precisely aware of the plan for housing this "basal" data? @lparsons - care to weigh in?
Yes, sorry this was "discovered" because I was using the tracebase CONTRIBUTING
doc to cover for the out-of-date tracebase-rabinowitz-data docs/wiki. Not a big deal, just though I would report it.
Example data in the tracebase
repository can be whatever we like. It should be one (or a few) studies that new users and new developers can use to test their installation and use for development, etc. Testing data in tracebase
can be various types of broken, edge cases, etc. that is used only for testing.
Data for our production system should be entirely separate, and is currently housed in the tracebase-rabinowitz-data
repository (except mzXML
files). For that data, all compounds, tissues, instruments, lc_methods, etc. should be kept separate from each individual study and loaded first. See https://github.com/PrincetonUniversity/tracebase-rabinowitz-data/issues/92.
Hopefully this helps clear things up, but let me know if I can help clarify anything else.
Mea culpa. John had noted that the instructions in the admin docs was stale and I pointed him to the CONTRIB doc.
@lparsons - I think this issue can be closed, but I'd like to see what you think. This issue stems from loading example data during the work on loading production data, so there were bound to be consistency issues. If anything, this could be supplanted by a documentation issue (if there's still a problem - I haven't checked).
While the example data and production data are separate, it seems reasonable that we would use the same HMDB ID for the same named compound in both, at least for consistency.
@mneinast Can you help us determine which of the following records is "preferred"? The main thing to consider is whether HMDB0258206 or HMDB0001068 is the preferable HMDB record for this compound.
Production data:
sedoheptulose 7-phosphate C7H15O10P HMDB0258206
It looks like the name may have originated from study obob_fasted_ace_glycerol_3hb_citrate_eaa_fa
by Xianfeng Zeng in 2022.
It was transferred to the consolidated list here on Feb 10th of 2023:
one later study had the dash version changed later on to match the pre-existing study, as documented here:
Because name and HMDB ID differs between consolidated list https://github.com/Princeton-LSI-ResearchComputing/tracebase/blob/4bb4583c9898868cdb5800b9d0a2dd6bf3228339/DataRepo/data/examples/compounds/consolidated_tracebase_compound_list.tsv#L51
and the Rabinowitz data repo
https://github.com/PrincetonUniversity/tracebase-rabinowitz-data/blob/9d44914f73de1491f2cd44dcf75f115f3ce45762/compounds/compounds.tsv#L164
Also, regarding differences, there are 52 lines in the former and 184 lines in the latter, so I think the former ("consolidated") might just be for example/test data, and is not comprehensive (consolidated, but perhaps in a different context?).