MoseleyBioinformaticsLab / mwtab

The mwtab package is a Python library that facilitates reading and writing files in mwTab format used by the Metabolomics Workbench for archival of Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR) experimental data.
http://mwtab.readthedocs.io
BSD 3-Clause Clear License
12 stars 2 forks source link

various failures of json parson #9

Closed rmflight closed 5 months ago

rmflight commented 1 year ago

As of a metabolomics workbench pull on 2023-04-24, using the jsonlite R json parsing library, I have a bunch of analyses that fail to parse, for different reasons.

I provide the links to the analyses, and the error that is generated here, and I've grouped the analyses by the type of failure. Interestingly, there seem to be about 7 different failure modes.

key value

Error in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ditional sample data":{"QC01"} }, { "Subject ID":"-", "Sampl
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          8","JWN ID":"LNG-18","LNG-19","File name (Oxylipin data)":"L
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          8","JWN ID":"LNG-18","LNG-19","File name (Oxylipin data)":"L
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          8","JWN ID":"LNG-18","LNG-19","File name (Oxylipin data)":"L
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          l sample data":{"Tissue""Gast""L. medial superficial","ID""M
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          FM  cold 24-120h","6-23-2012"} }, { "Subject ID":"-", "Sampl
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          mMGTP1713_A_0", "Factors":{""}, "Additional sample data":{"t
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
           osteoarthritis, tobacco use"} }, { "Subject ID":"GSU019", "
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          JECT_SAMPLE_FACTORS:        "} }, { "Subject ID":"-", "Sampl
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          W_FILE_NAME":"1_neg","1_neg2","1_Pos","1_Pos2"} }, { "Subjec
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          W_FILE_NAME":"1_neg","1_neg2","1_Pos","1_Pos2"} }, { "Subjec
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ndian/Alaskan Native","White","RAW_FILE_NAME":"0232_WUG_FARM
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ndian/Alaskan Native","White","RAW_FILE_NAME":"0232_WUG_FARM
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ndian/Alaskan Native","White","RAW_FILE_NAME":"0232_WUG_FARM
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ndian/Alaskan Native","White","RAW_FILE_NAME":"0232_WUG_FARM
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          nt, effective sac -&gt","d19","Plasma.Volume.Gross":"650","H
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ":"Bov2_1_R001","Bov2_1_R002","Bov2_1_R003","RAW_FILE_NAME":
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ":"Bov2_1_R001","Bov2_1_R002","Bov2_1_R003","RAW_FILE_NAME":
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          ":"Bov2_1_R001","Bov2_1_R002","Bov2_1_R003","RAW_FILE_NAME":
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          NAME":"PLD3.scan","PLD3.wiff","Analysed data":"PLD3 lipidomi
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          NAME":"PLD3.scan","PLD3.wiff","Analysed data":"PLD3 lipidomi
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          0_cation/1.d","230_anion/1.d"} }, { "Subject ID":"-", "Sampl
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          0_cation/1.d","230_anion/1.d"} }, { "Subject ID":"-", "Sampl
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: object key and value must be separated by a colon (':')
          E":"AS_01.mzML","AS_001.mzML"} }, { "Subject ID":"01-01-017"
                     (right here) ------^

key value2

rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
           data":{"Obesity (BMI>30":"1)":"0","Ischemia (1)":"1","Age":
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"peak_height",  "Data":[{"Met
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"abundance & normalized peak 
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"peak_height",  "Data":[{"Met
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"natural abundance corrected 
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"peak_height",  "Data":[{"Met
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"natural abundance corrected 
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          SIS_TYPE":"MS"},  "MS":"Units":"peak_height",  "Data":[{"Met
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          {"characteristics":"treatment":"untreated"}, "Additional sam
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          {"characteristics":"treatment":"untreated"}, "Additional sam
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          ors":{"Caco2":"HT29-MTX_ratio":"100;0","Ionisation_mode":"Ne
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: after key and value, inside map, I expect ',' or '}'
          ors":{"Caco2":"HT29-MTX_ratio":"100;0","Ionisation_mode":"Ne
                     (right here) ------^

array issue

rror in parse_con(txt, bigint_as_char) : 
  parse error: after array element, I expect ',' or ']'
          9S065_ZHP"} }, [],  "SUBJECT":], "COLLECTION":{"COLLECTION_S
                     (right here) ------^

lexical

rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          =C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          A-N","SMILES":"CCCCCC(C(C/C=C\CCCCCCCC(O)=O)O)O","CAS":"2633
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          :"O=S(/N=C(CCSCC1=CSC(/N=C(N)\N)=N1)/N)(N)=O","CAS":"76824-3
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          =C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          A-N","SMILES":"CCCCCC(C(C/C=C\CCCCCCCC(O)=O)O)O","CAS":"2633
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          WBI","SMILES":"CCCCCC(C(C/C=C\CCCCCCCC(O)=O)O)O"},{"Metaboli
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          la":"C4H4O4","SMILES":"OC(=O)\C=C\C(O)=O"},{"Metabolite":"UD
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          Metabolite":"PI(15-MHDA_20:4)\PI(17:0_20:4)","37894_H":"111"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          TLIN":"4198","SMILES":"OC(=O)\C=C/C(O)=O","Comment":"1"},{"M
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          ZSA-N","SMILES":"CCCCCCCC/C=C\CCCC(=O)O"},{"Metabolite":"FA 
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: inside a string, '\' occurs before a character which it may not.
          TLIN":"NA","SMILES":"CCCCCCCC\C=C/C\C=C/C\C=C/CCCC(O)=O"},{"
                     (right here) ------^

invalid character

rror in parse_con(txt, bigint_as_char) : 
  lexical error: invalid character inside string.
          3":"53122664"},{"Metabolite":"uorene","4856":"1911632","823
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  lexical error: invalid character inside string.
          bundance corrected percentages                ",  "Data":[{"Metabolite":
                     (right here) ------^

token

rror in parse_con(txt, bigint_as_char) : 
  parse error: unallowed token at this point in JSON text
          ETECTOR_TYPE":"TOF"},  "MS": }  
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: unallowed token at this point in JSON text
          ":[],  "ANALYSIS":[],  "MS": }  
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: unallowed token at this point in JSON text
          MATOGRAPHY":[],  "ANALYSIS": }  { "METABOLOMICS WORKBENCH":{
                     (right here) ------^

invalid key

rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          its":"normalized peak area", [],  "EXTENDED_METABOLITE_DATA"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          its":"normalized peak area", [],  "EXTENDED_METABOLITE_DATA"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          its":"normalized peak area", [],  "EXTENDED_METABOLITE_DATA"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          its":"normalized peak area", [],  "EXTENDED_METABOLITE_DATA"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          its":"normalized peak area", [],  "EXTENDED_METABOLITE_DATA"
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          R_A":"16.36836876784804"}],  }  
                     (right here) ------^
rror in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          ","60_060":"318.4301983"}],  }  
                     (right here) ------^
ptth222 commented 6 months ago

Maybe I misunderstand, but this doesn't look like an issue in mwtab. "parse_con" seems to be a function somewhere else. Are you saying that mwtab should not be validating these files? When I read them in using mwtab there are no errors.

rmflight commented 6 months ago

You're right @ptth222 , it's an R function that doesn't like them. But it seems really, really weird that it can parse a whole of the mwtab files, but these particular ones, they fail.

Because it doesn't feel like a pervasive problem throughout all of the mwtab files from MW, it felt like it had to be something off with these particular files.

I can try some of them in another R based parser, and see if I still have issues.

ptth222 commented 6 months ago

I investigated these some more.

key value

These are strange. Was there any editing before trying to parse? If you manually search for the piece of the file that is shown in the error you can't find it. For example, for the first file in this section 'ditional sample data":{"QC01"} },' is not in that file anywhere. "Additional sample data" is always followed with {"Tissue weight": "..."}. I checked the first 3 files and this was the case for each one. The piece of the file shown in the error isn't in the file. I don't know what's up with these.

key value2

The first one in the list is genuinely a bad JSON I think. Python won't read it either. There are keys like "Obesity (BMI>30:1)" in the "Additional sample data", but the "30:1" part has extra quotation marks around the semicolon, so "Obesity (BMI>30":"1)". This makes it not parse right. If you remove the extra quotation marks it parses fine. I'm not sure how a mistake like that got in there. I guess the Workbench has an issue in how they are generating the JSON. They should probably be told that keys with a semicolon in them are causing issues.

The others I checked in this section have the same issue as what I said in the "key value" section. It looks like chunks of the file were deleted or something. The second file has a piece of file in the error:

SIS_TYPE":"MS"}, "MS":"Units":"peak_height", "Data":[{"Met

Here is the section in the file:

"ANALYSIS":{"ANALYSIS_TYPE":"MS"},

"MS":{"INSTRUMENT_NAME":"Orbitrap Fusion","INSTRUMENT_TYPE":"FTMS","MS_TYPE":"ESI","ION_MODE":"POSITIVE","MS_COMMENTS":"Measurements made using direct infusion (nano-electrospray) FTMS."},

"MS_METABOLITE_DATA":{
"Units":"peak_height",

The piece in the error is missing the value for "MS" and the "MS_METABOLITE_DATA" key, what the parser is seeing is:

"ANALYSIS":{"ANALYSIS_TYPE":"MS"},

"MS":
"Units":"peak_height",

Not sure what is going on here.

array issue

Same as "key value". This one is especially strange because it looks very out of order. The error piece is:

9S065_ZHP"} }, [], "SUBJECT":], "COLLECTION":{"COLLECTION_S

'"SUBJECT": ' only ever appears in one place, which is before SUBJECT_SAMPLE_FACTORS, but the fragment '9S065_ZHP' is a piece of the RAW_FILE_NAME which is in the "Additional sample data" after the "SUBJECT" section. Long story short, this doesn't look like a simple deletion of a section. One of the changes I made in my branch was for issue #10. Basically, the package did not previously ensure the correct ordering of the sections, but I changed that. I'm wondering if that might have something to do with this. It's hard to say without knowing the exact workflow you used here.

lexical

I couldn't even find a piece of the piece of the error message in these files. Piece from first error: =C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)

I couldn't find any of this in the file. Same for the next 2. They load fine in Python.

invalid character

The first one is a genuine bad character that isn't known to the encoding. Python can't read it either with default or UTF-8 encoding. There is probably one that would, but I don't know what it is. The second I don't know. Python reads it fine and I can't see any additional characters manually. It might have gotten injected somewhere.

token

This seems similar to "key value2". It looks like parts were somehow deleted out of the file. Error piece from first: ETECTOR_TYPE":"TOF"}, "MS": }

You can see that '"MS":' doesn't have it's dictionary after it for some reason.

invalid key

More of the same. Looks like the file got corrupted or changed or something. The first 5 in particular are weird because the units are completely different. The piece of the error shows "normalized peak area", but the file has "Peak area normalized".

Overall I'm not sure exactly what's going on, but for the majority it looks like the file didn't download correctly or something. I think we would have to investigate the full workflow used to get the errors originally. The issues with semicolons in the keys and bad characters is something the Workbench should probably try to validate and or fix though.

hunter-moseley commented 6 months ago

This is a bigger issue if file download integrity is involved.

On Thu, Apr 25, 2024 at 9:13 PM ptth222 @.***> wrote:

I investigated these some more. key value

These are strange. Was there any editing before trying to parse? If you manually search for the piece of the file that is shown in the error you can't find it. For example, for the first file in this section 'ditional sample data":{"QC01"} },' is not in that file anywhere. "Additional sample data" is always followed with {"Tissue weight": "..."}. I checked the first 3 files and this was the case for each one. The piece of the file shown in the error isn't in the file. I don't know what's up with these. key value2

The first one in the list is genuinely a bad JSON I think. Python won't read it either. There are keys like "Obesity (BMI>30:1)" in the "Additional sample data", but the "30:1" part has extra quotation marks around the semicolon, so "Obesity (BMI>30":"1)". This makes it not parse right. If you remove the extra quotation marks it parses fine. I'm not sure how a mistake like that got in there. I guess the Workbench has an issue in how they are generating the JSON. They should probably be told that keys with a semicolon in them are causing issues.

The others I checked in this section have the same issue as what I said in the "key value" section. It looks like chunks of the file were deleted or something. The second file has a piece of file in the error:

SIS_TYPE":"MS"}, "MS":"Units":"peak_height", "Data":[{"Met

Here is the section in the file:

"ANALYSIS":{"ANALYSIS_TYPE":"MS"},

"MS":{"INSTRUMENT_NAME":"Orbitrap Fusion","INSTRUMENT_TYPE":"FTMS","MS_TYPE":"ESI","ION_MODE":"POSITIVE","MS_COMMENTS":"Measurements made using direct infusion (nano-electrospray) FTMS."},

"MS_METABOLITE_DATA":{ "Units":"peak_height",

The piece in the error is missing the value for "MS" and the "MS_METABOLITE_DATA" key, what the parser is seeing is:

"ANALYSIS":{"ANALYSIS_TYPE":"MS"},

"MS": "Units":"peak_height",

Not sure what is going on here. array issue

Same as "key value". This one is especially strange because it looks very out of order. The error piece is:

9S065_ZHP"} }, [], "SUBJECT":], "COLLECTION":{"COLLECTION_S

'"SUBJECT": ' only ever appears in one place, which is before SUBJECT_SAMPLE_FACTORS, but the fragment '9S065_ZHP' is a piece of the RAW_FILE_NAME which is in the "Additional sample data" after the "SUBJECT" section. Long story short, this doesn't look like a simple deletion of a section. One of the changes I made in my branch was for issue #10 https://github.com/MoseleyBioinformaticsLab/mwtab/issues/10. Basically, the package did not previously ensure the correct ordering of the sections, but I changed that. I'm wondering if that might have something to do with this. It's hard to say without knowing the exact workflow you used here. lexical

I couldn't even find a piece of the piece of the error message in these files. Piece from first error: =C3C(CCC(O)=O)=C(C)C(/C=C(N4)\C(C)=C(C=C)C4=O)=N\3)=C1CCC(O)

I couldn't find any of this in the file. Same for the next 2. They load fine in Python. invalid character

The first one is a genuine bad character that isn't known to the encoding. Python can't read it either with default or UTF-8 encoding. There is probably one that would, but I don't know what it is. The second I don't know. Python reads it fine and I can't see any additional characters manually. It might have gotten injected somewhere. token

This seems similar to "key value2". It looks like parts were somehow deleted out of the file. Error piece from first: ETECTOR_TYPE":"TOF"}, "MS": }

You can see that '"MS":' doesn't have it's dictionary after it for some reason. invalid key

More of the same. Looks like the file got corrupted or changed or something. The first 5 in particular are weird because the units are completely different. The piece of the error shows "normalized peak area", but the file has "Peak area normalized".

Overall I'm not sure exactly what's going on, but for the majority it looks like the file didn't download correctly or something. I think we would have to investigate the full workflow used to get the errors originally. The issues with semicolons in the keys and bad characters is something the Workbench should probably try to validate and or fix though.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/mwtab/issues/9#issuecomment-2078454596, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7BZAH6VV4S7LCBURBV3Y7GS3XAVCNFSM6AAAAAAXURACHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGQ2TINJZGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 6 months ago

The workflow was literally having mwtab grab everything on workbench:

mwtab download study all --output-format="json" --verbose

And then start parsing the files for metadata so I could find experiments that were appropriate for the analysis I was trying to do using the jsonlite package in R.

ptth222 commented 5 months ago

We discussed this in a lab meeting and determined that what the Workbench pushes to you is now different than when these were originally downloaded. That is the source of most of the errors. I don't see anything to change in the package to address any of this. Can we close this issue?

rmflight commented 5 months ago

Sure thing.