Closed rmflight closed 5 months ago
Maybe I misunderstand, but this doesn't look like an issue in mwtab. "parse_con" seems to be a function somewhere else. Are you saying that mwtab should not be validating these files? When I read them in using mwtab there are no errors.
You're right @ptth222 , it's an R function that doesn't like them. But it seems really, really weird that it can parse a whole of the mwtab files, but these particular ones, they fail.
Because it doesn't feel like a pervasive problem throughout all of the mwtab files from MW, it felt like it had to be something off with these particular files.
I can try some of them in another R based parser, and see if I still have issues.
I investigated these some more.
These are strange. Was there any editing before trying to parse? If you manually search for the piece of the file that is shown in the error you can't find it. For example, for the first file in this section 'ditional sample data":{"QC01"} },' is not in that file anywhere. "Additional sample data" is always followed with {"Tissue weight": "..."}. I checked the first 3 files and this was the case for each one. The piece of the file shown in the error isn't in the file. I don't know what's up with these.
The first one in the list is genuinely a bad JSON I think. Python won't read it either. There are keys like "Obesity (BMI>30:1)" in the "Additional sample data", but the "30:1" part has extra quotation marks around the semicolon, so "Obesity (BMI>30":"1)". This makes it not parse right. If you remove the extra quotation marks it parses fine. I'm not sure how a mistake like that got in there. I guess the Workbench has an issue in how they are generating the JSON. They should probably be told that keys with a semicolon in them are causing issues.
The others I checked in this section have the same issue as what I said in the "key value" section. It looks like chunks of the file were deleted or something. The second file has a piece of file in the error:
SIS_TYPE":"MS"}, "MS":"Units":"peak_height", "Data":[{"Met
Here is the section in the file:
"ANALYSIS":{"ANALYSIS_TYPE":"MS"},
"MS":{"INSTRUMENT_NAME":"Orbitrap Fusion","INSTRUMENT_TYPE":"FTMS","MS_TYPE":"ESI","ION_MODE":"POSITIVE","MS_COMMENTS":"Measurements made using direct infusion (nano-electrospray) FTMS."},
"MS_METABOLITE_DATA":{
"Units":"peak_height",
The piece in the error is missing the value for "MS" and the "MS_METABOLITE_DATA" key, what the parser is seeing is:
"ANALYSIS":{"ANALYSIS_TYPE":"MS"},
"MS":
"Units":"peak_height",
Not sure what is going on here.
Same as "key value". This one is especially strange because it looks very out of order. The error piece is:
9S065_ZHP"} }, [], "SUBJECT":], "COLLECTION":{"COLLECTION_S
'"SUBJECT": ' only ever appears in one place, which is before SUBJECT_SAMPLE_FACTORS, but the fragment '9S065_ZHP' is a piece of the RAW_FILE_NAME which is in the "Additional sample data" after the "SUBJECT" section. Long story short, this doesn't look like a simple deletion of a section. One of the changes I made in my branch was for issue #10. Basically, the package did not previously ensure the correct ordering of the sections, but I changed that. I'm wondering if that might have something to do with this. It's hard to say without knowing the exact workflow you used here.
I couldn't even find a piece of the piece of the error message in these files.
Piece from first error:
=C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)
I couldn't find any of this in the file. Same for the next 2. They load fine in Python.
The first one is a genuine bad character that isn't known to the encoding. Python can't read it either with default or UTF-8 encoding. There is probably one that would, but I don't know what it is. The second I don't know. Python reads it fine and I can't see any additional characters manually. It might have gotten injected somewhere.
This seems similar to "key value2". It looks like parts were somehow deleted out of the file.
Error piece from first:
ETECTOR_TYPE":"TOF"}, "MS": }
You can see that '"MS":' doesn't have it's dictionary after it for some reason.
More of the same. Looks like the file got corrupted or changed or something. The first 5 in particular are weird because the units are completely different. The piece of the error shows "normalized peak area", but the file has "Peak area normalized".
Overall I'm not sure exactly what's going on, but for the majority it looks like the file didn't download correctly or something. I think we would have to investigate the full workflow used to get the errors originally. The issues with semicolons in the keys and bad characters is something the Workbench should probably try to validate and or fix though.
This is a bigger issue if file download integrity is involved.
On Thu, Apr 25, 2024 at 9:13 PM ptth222 @.***> wrote:
I investigated these some more. key value
These are strange. Was there any editing before trying to parse? If you manually search for the piece of the file that is shown in the error you can't find it. For example, for the first file in this section 'ditional sample data":{"QC01"} },' is not in that file anywhere. "Additional sample data" is always followed with {"Tissue weight": "..."}. I checked the first 3 files and this was the case for each one. The piece of the file shown in the error isn't in the file. I don't know what's up with these. key value2
The first one in the list is genuinely a bad JSON I think. Python won't read it either. There are keys like "Obesity (BMI>30:1)" in the "Additional sample data", but the "30:1" part has extra quotation marks around the semicolon, so "Obesity (BMI>30":"1)". This makes it not parse right. If you remove the extra quotation marks it parses fine. I'm not sure how a mistake like that got in there. I guess the Workbench has an issue in how they are generating the JSON. They should probably be told that keys with a semicolon in them are causing issues.
The others I checked in this section have the same issue as what I said in the "key value" section. It looks like chunks of the file were deleted or something. The second file has a piece of file in the error:
SIS_TYPE":"MS"}, "MS":"Units":"peak_height", "Data":[{"Met
Here is the section in the file:
"ANALYSIS":{"ANALYSIS_TYPE":"MS"},
"MS":{"INSTRUMENT_NAME":"Orbitrap Fusion","INSTRUMENT_TYPE":"FTMS","MS_TYPE":"ESI","ION_MODE":"POSITIVE","MS_COMMENTS":"Measurements made using direct infusion (nano-electrospray) FTMS."},
"MS_METABOLITE_DATA":{ "Units":"peak_height",
The piece in the error is missing the value for "MS" and the "MS_METABOLITE_DATA" key, what the parser is seeing is:
"ANALYSIS":{"ANALYSIS_TYPE":"MS"},
"MS": "Units":"peak_height",
Not sure what is going on here. array issue
Same as "key value". This one is especially strange because it looks very out of order. The error piece is:
9S065_ZHP"} }, [], "SUBJECT":], "COLLECTION":{"COLLECTION_S
'"SUBJECT": ' only ever appears in one place, which is before SUBJECT_SAMPLE_FACTORS, but the fragment '9S065_ZHP' is a piece of the RAW_FILE_NAME which is in the "Additional sample data" after the "SUBJECT" section. Long story short, this doesn't look like a simple deletion of a section. One of the changes I made in my branch was for issue #10 https://github.com/MoseleyBioinformaticsLab/mwtab/issues/10. Basically, the package did not previously ensure the correct ordering of the sections, but I changed that. I'm wondering if that might have something to do with this. It's hard to say without knowing the exact workflow you used here. lexical
I couldn't even find a piece of the piece of the error message in these files. Piece from first error: =C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)
I couldn't find any of this in the file. Same for the next 2. They load fine in Python. invalid character
The first one is a genuine bad character that isn't known to the encoding. Python can't read it either with default or UTF-8 encoding. There is probably one that would, but I don't know what it is. The second I don't know. Python reads it fine and I can't see any additional characters manually. It might have gotten injected somewhere. token
This seems similar to "key value2". It looks like parts were somehow deleted out of the file. Error piece from first: ETECTOR_TYPE":"TOF"}, "MS": }
You can see that '"MS":' doesn't have it's dictionary after it for some reason. invalid key
More of the same. Looks like the file got corrupted or changed or something. The first 5 in particular are weird because the units are completely different. The piece of the error shows "normalized peak area", but the file has "Peak area normalized".
Overall I'm not sure exactly what's going on, but for the majority it looks like the file didn't download correctly or something. I think we would have to investigate the full workflow used to get the errors originally. The issues with semicolons in the keys and bad characters is something the Workbench should probably try to validate and or fix though.
— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/mwtab/issues/9#issuecomment-2078454596, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7BZAH6VV4S7LCBURBV3Y7GS3XAVCNFSM6AAAAAAXURACHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGQ2TINJZGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
The workflow was literally having mwtab
grab everything on workbench:
mwtab download study all --output-format="json" --verbose
And then start parsing the files for metadata so I could find experiments that were appropriate for the analysis I was trying to do using the jsonlite
package in R.
We discussed this in a lab meeting and determined that what the Workbench pushes to you is now different than when these were originally downloaded. That is the source of most of the errors. I don't see anything to change in the package to address any of this. Can we close this issue?
Sure thing.
As of a metabolomics workbench pull on 2023-04-24, using the
jsonlite
R json parsing library, I have a bunch of analyses that fail to parse, for different reasons.I provide the links to the analyses, and the error that is generated here, and I've grouped the analyses by the type of failure. Interestingly, there seem to be about 7 different failure modes.
key value
key value2
array issue
lexical
invalid character
token
invalid key