CBIIT / R-cometsAnalytics

R package development for COMETS Analytics
12 stars 10 forks source link

COMETS 1.4. Integrity check fails when comp-id has characters #66

Closed steven-moore closed 6 years ago

steven-moore commented 6 years ago

The new harmonization algorithm is checking for chemical-id and comp-id columns, and using them to perform matches to the UID file. However, this assumes that no characters have been added to the comp-id or chemical id. For those using R for data analysis, it will be relatively common for them to use comp or chemical id as their metabolite names, and to add a character to the front to comply with R's column name requirements. This will return an error, per screenshot below. Could we add a step where we strip out any characters in the comp or chemical id? Sample files for testing also included below.

image

Scrambled.CPSII.data (2).xlsx

Scrambled.CPSII.data_comp_id_removed.xlsx

steven-moore commented 6 years ago

A related issue is that the IMS_UID should be more fully utilized in the harmonization scheme. After IMS harmonizes the metabolites, they will typically provide a file back to each group with a new field indicating the final harmonization. I don't believe that we have previously posted any datasets like this, so I have pasted one below which has the final IMS UID. Please let me know if you have any questions about this.

Scrambled.women.CPSII.data.xlsx

steven-moore commented 6 years ago

We determined that it is too difficult to "guess" about how to retrospectively fix these issues. Instead, if COMP-ID or CHEMICAL-ID are used, they need to be used as a number only field in the exact format received from Metabolon. So, we will keep this requirement--the current functionality is correct. We will also have to add instructions to our tutorials and e-mails to notify people. Issue closed.