UniversalDataTool / universal-data-tool

Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
https://universaldatatool.com
MIT License
1.96k stars 190 forks source link

feature: import of NER JSON #32

Open DSLituiev opened 4 years ago

DSLituiev commented 4 years ago

Hi there. hank you for a great tool. I am curious whether you are considering support for import of pre-annotated text for NER? This is a very common task in active learning setup / post-regex-clean up step.

seveibar commented 4 years ago

@DSLituiev for sure thanks for creating a feature for it.

Right now, if you want to import NER you have to construct a JSON file in this JSON format.

We would build an import button for whatever file format you're using if it's a common standard. What do you currently use to store NER?

DSLituiev commented 4 years ago

I currently store NER as JSONL (entry-per-line) in following formats (using your example):

{ "title": "document123",   "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!",  "entities": [    { label: "hat", start: 5, end: 13 },    { label: "food", start: 60, end: 69 }  ] }

Or

{ "title": "document123",   "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!",   "entities": [[5,13,"hat"],  [60, 69, "food"]] }

I see you require a schema, which is fair. I would prefer it still to be in JSONL format, with "header" / first line representing schema. But I'll be happy with any import functionality.

seveibar commented 4 years ago

We're working on a JSONL version of the udt format. Right now, the CSV format is very similar to JSONL you've written. We don't want to create a format that decouples the interface data from the sample data, but I understand that this is sometimes useful.

I think a CSV import will make this fairly easy to do and will work across all datatypes that we currently support. In a CSV import, the interface data can be ignored. An example import would look like the following...

myimport.udt.csv path document output.entities
samples.0 This strainer makes a great... [ { "label": "hat", "start": ... } ]
samples.1 Boy spaghetti is sure tasty... [ { "label": "food", "start": ... } ]

The *.udt.csv format is fairly flexible with column labels, so this would also be acceptable...

myimport.udt.csv path document output
samples.0 This strainer makes a great... { "entities": [ { "label": "hat", "start": ... } ]}
samples.1 Boy spaghetti is sure tasty... {"entities": [ { "label": "food", "start": ... } ]}

The reason to prefer the csv over JSONL for the moment (and the difficulty in general with JSONL) is interface data is easily included with the csv format e.g. ...

path . document output
interface { .... }
samples.0 This strainer makes a great... { "entities": [ { "label": "hat", "start": ... } ]}
samples.1 Boy spaghetti is sure tasty... {"entities": [ { "label": "food", "start": ... } ]}
DSLituiev commented 4 years ago

Glad to hear. I am not sure I grasp what you mean by "interface" when referring to data -- is it metadata / schema / label categories?

Well, IMO it is not any easier than having a JSONL with first line being the schema, but up to you.

DSLituiev commented 4 years ago

I gave it a try. My impression: formatting data into udt.csv is hard. One needs to write tons of custom formatting pieces, come up with quotations to escape json quotations etc etc, given hybrid nature of this format (csv+json).

this example fails for no obvious reason:

``` path,.,document,output interface,'[{"id":"diseases","displayName":"disease"},{"id":"hx_diseases","displayName":"history of disease"},{"id":"neg_diseases","displayName":"negated disease"},{"id":"medications","displayName":"medication"},{"id":"hx_medications","displayName":"history of medication"},{"id":"neg_medications","displayName":"negated medication"},{"id":"procedures","displayName":"procedure"},{"id":"hx_procedures","displayName":"history of procedure"},{"id":"neg_procedures","displayName":"negated procedure"},{"id":"symptoms","displayName":"symptom"},{"id":"hx_symptoms","displayName":"history of symptom"},{"id":"neg_symptoms","displayName":"negated symptom"}]',, 'admission-note-for-abdominal-pain.txt',,'Admission Note for Abdominal Pain\nDATE: ................\n\nCHIEF COMPLAINT: Abdominal pain x ......... hours/days/months\n\nHISTORY OF PRESENT ILLNESS:\n\nSite -\nOnset -\nCharacter -\nRadiation -\nAlleviating factors -\nTime course -\nExacerbating factors -\nSeverity -\nSimilar pain before -\nNausea -\nVomiting -\nDiarrhea -\nConstipation -\nLoss of appetite -\nBlack/bloody stools -\nSick contacts -, Suspicious food consumed -\nFever/chills -, SOB -, Chest pain -, Headache -\nDysuria -\n\nER Tx given -\n\nPAST MEDICAL HISTORY: (circle all that apply)\nPUD Gallstones Kidney stones UTIs MI CAD HTN DM\nStroke CA PVD DVT COPD Asthma\nEGD -\nColonoscopy -\n\nPAST SURGICAL HISTORY: (circle all that apply)\nCholecystectomy Hernia Appendectomy Hysterectomy\n\nMEDICATIONS:\n\nALLERGY: NKDA\n\nFMH: (circle all that apply)\nCAD 55 yo DM Stroke HTN CA\n\nSOCIAL HISTORY: (circle all that apply)\nIndependent NH Lives w spouse son daughter\nAlcohol - no heavy occasional last drink\nSmoker - no\nIllicit drugs - no cocaine heroin marijuana\n\nREVIEW OF SYSTEMS: unremarkable apart from above symptoms\n\nPHYSICAL EXAM:\nVITALS: Orthostatics -\nSpO2 - Initial vitals -\n\nGENERAL APPEARANCE: WD/WN in NAD\nSKIN: no rash\nHEENT: NC/AT, PERRLA (B), moist MM, no epistaxis\nNECK: Supple, no JVD +JVD\nLUNGS: CTA (B) crackles L R B wheezing\nHEART: Clear S1S2, RRR irregular murmur S D /6 S3\nABDOMEN: Soft, NT, ND, +BS\nRectal exam:\nEXTREMITIES: no edema +edema\nPERIPHERAL VASCULAR: palpable nonpalpable Doppler\nNEURO:\nAAO x 3, CN 2-12: non focal\nMUSCLE STRENGHT: 5/5 (B), SENSATION: nonfocal\nDTR: ++, CEREBELLAR: non focal\n\nLABS:\n\nN= B= L= AG= LFT\nAmylase , Lipase\nCardiac enzymes x 1 - negative , UA:\nBlood cx:\nCXR:\nKUB:\nEKG:\n\nASSESSMENT:\n- Abdominal pain due to\n*Gastroenteritis\n*Gastritis\n*PUD\n*Pancreatitis\n*Cholecystitis\n*Diverticulitis\n*UTI\n\nPLAN:\n- NPO apart from meds\n- IVF, D5 1/2 NS at 125 cc/hr x 2 L\n- EKG in AM\n- Urine C+S\n- Morphine 2 mg IV q 2-4 hr PRN pain\n- Liver/gallbladder U/S\n- CT abdomen (with or without PO and IV contrast)\n- GI consult\n- CBCD, CMP in AM\n\nSignature:\n\n\nPublished: 02/12/2005\nUpdated: 03/08/2009\n','{"entities": [{"start": 19, "end": 33, "label": "symptoms"}, {"start": 64, "end": 73, "label": "symptoms"}, {"start": 75, "end": 89, "label": "symptoms"}, {"start": 140, "end": 147, "label": "hx_symptoms"}, {"start": 177, "end": 186, "label": "procedures"}, {"start": 267, "end": 271, "label": "symptoms"}, {"start": 281, "end": 287, "label": "symptoms"}, {"start": 290, "end": 298, "label": "symptoms"}, {"start": 301, "end": 309, "label": "symptoms"}, {"start": 312, "end": 324, "label": "symptoms"}, {"start": 327, "end": 343, "label": "symptoms"}, {"start": 346, "end": 365, "label": "symptoms"}, {"start": 368, "end": 372, "label": "symptoms"}, {"start": 412, "end": 424, "label": "symptoms"}, {"start": 428, "end": 431, "label": "symptoms"}, {"start": 435, "end": 445, "label": "symptoms"}, {"start": 449, "end": 457, "label": "symptoms"}, {"start": 460, "end": 467, "label": "symptoms"}, {"start": 536, "end": 546, "label": "diseases"}, {"start": 547, "end": 560, "label": "diseases"}, {"start": 561, "end": 565, "label": "diseases"}, {"start": 569, "end": 572, "label": "diseases"}, {"start": 573, "end": 576, "label": "diseases"}, {"start": 580, "end": 586, "label": "diseases"}, {"start": 590, "end": 593, "label": "diseases"}, {"start": 594, "end": 597, "label": "diseases"}, {"start": 598, "end": 602, "label": "diseases"}, {"start": 603, "end": 609, "label": "diseases"}, {"start": 610, "end": 613, "label": "procedures"}, {"start": 616, "end": 627, "label": "procedures"}, {"start": 678, "end": 693, "label": "procedures"}, {"start": 694, "end": 700, "label": "diseases"}, {"start": 701, "end": 713, "label": "procedures"}, {"start": 714, "end": 726, "label": "procedures"}, {"start": 742, "end": 749, "label": "symptoms"}, {"start": 786, "end": 789, "label": "diseases"}, {"start": 799, "end": 805, "label": "diseases"}, {"start": 806, "end": 809, "label": "diseases"}, {"start": 814, "end": 828, "label": "symptoms"}, {"start": 854, "end": 865, "label": "symptoms"}, {"start": 897, "end": 904, "label": "medications"}, {"start": 938, "end": 944, "label": "symptoms"}, {"start": 950, "end": 963, "label": "medications"}, {"start": 969, "end": 976, "label": "medications"}, {"start": 977, "end": 983, "label": "neg_medications"}, {"start": 984, "end": 993, "label": "neg_medications"}, {"start": 1146, "end": 1149, "label": "medications"}, {"start": 1159, "end": 1163, "label": "neg_symptoms"}, {"start": 1203, "end": 1212, "label": "neg_symptoms"}, {"start": 1230, "end": 1233, "label": "neg_symptoms"}, {"start": 1235, "end": 1238, "label": "symptoms"}, {"start": 1254, "end": 1262, "label": "symptoms"}, {"start": 1269, "end": 1277, "label": "symptoms"}, {"start": 1311, "end": 1317, "label": "symptoms"}, {"start": 1355, "end": 1361, "label": "medications"}, {"start": 1384, "end": 1389, "label": "neg_symptoms"}, {"start": 1391, "end": 1396, "label": "neg_symptoms"}, {"start": 1439, "end": 1446, "label": "procedures"}, {"start": 1508, "end": 1517, "label": "symptoms"}, {"start": 1560, "end": 1564, "label": "symptoms"}, {"start": 1580, "end": 1583, "label": "procedures"}, {"start": 1584, "end": 1591, "label": "medications"}, {"start": 1594, "end": 1600, "label": "medications"}, {"start": 1601, "end": 1616, "label": "medications"}, {"start": 1648, "end": 1651, "label": "procedures"}, {"start": 1658, "end": 1661, "label": "procedures"}, {"start": 1678, "end": 1692, "label": "symptoms"}, {"start": 1701, "end": 1716, "label": "diseases"}, {"start": 1718, "end": 1727, "label": "diseases"}, {"start": 1734, "end": 1746, "label": "diseases"}, {"start": 1748, "end": 1761, "label": "diseases"}, {"start": 1763, "end": 1777, "label": "diseases"}, {"start": 1779, "end": 1782, "label": "diseases"}, {"start": 1784, "end": 1788, "label": "diseases"}, {"start": 1792, "end": 1795, "label": "procedures"}, {"start": 1814, "end": 1817, "label": "procedures"}, {"start": 1850, "end": 1853, "label": "procedures"}, {"start": 1874, "end": 1882, "label": "medications"}, {"start": 1904, "end": 1908, "label": "symptoms"}, {"start": 1935, "end": 1945, "label": "procedures"}, {"start": 1973, "end": 1981, "label": "medications"}, {"start": 2004, "end": 2007, "label": "medications"}]}' ```
Error: 
JSON Error: SyntaxError: Unexpected token p in JSON at position 0
CSV Error: TypeError: Cannot read property 'trim' of undefined
DSLituiev commented 4 years ago

I am not clear how text and labels are linked per this document. Is it just sequence order? What if some documents have no annotations?

seveibar commented 4 years ago

Thanks @DSLituiev, and sorry for the delay in answering. It's really important that the format is as easy to use as possible.

I'm taking a look at the details you've posted to understand where the confusion is. There is an update coming to the format that alleviates the need for embedded JSON for most things except the interface.

seveibar commented 4 years ago

I believe the issue is the string delimination with apostrophe instead of quote. Our CSV parser is probably trying to be compliant with RFC 4180 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard.

Edit: I was wrong about. The path variable is what is confusing it. I'm changing the error message to reflect something that makes more sense in the future. I'll post an update soon.

Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this:

{
  "interface": { /* ... */ },
  "samples": Array<{
      /* document, imageUrl etc. */
     "output": {
      /* entities etc. */
     }
   }>
}

Currently if a sample does not have annotations, it is represented by null. If a sample has been annotated to be empty, it has an empty array in entities.

How do you feel about that revised format?

DSLituiev commented 4 years ago

I would rename "output" to "labels" like docanno uses. It sounds more intuitive, especially when "output" is in the input

On Sat, Apr 11, 2020 at 12:38 PM Severin Ibarluzea notifications@github.com wrote:

I believe the issue is the string delimination with apostrophe instead of quote. Our CSV compliant with RFC 4180 https://tools.ietf.org/html/rfc4180#page-2 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard.

Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this:

{ "interface": { / ... / }, "samples": { / document, imageUrl etc. / output: { / entities etc. / } }

Currently if a sample does not have annotations, it is represented by null. If a sample has been annotated to be empty, it has an empty array in entities.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UniversalDataTool/universal-data-tool/issues/32#issuecomment-612498520, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAJGMWYCCAUXIAICYOYKCLRMDBK7ANCNFSM4LSH2F6Q .

-- Dima Lituiev, PhD

seveibar commented 4 years ago

@DSLituiev what language/framework do you use (if I may ask)? Would python/npm bindings help for manipulating udt.* files?

It looks like it'll be really hard to support single-quote style csvs because there's ambiguity in CSVs that isn't easily figured out automatically. That said, the "trim()" error is a real error in our csv parsing library which I've fixed today. Thanks for reporting :) We had multiple issues which your bug report help identify, the import feature was released fairly recently.

Corrected CSV Document ``` path,.,document,output interface,"{""type"": ""text_entity_recognition"",""labels"": [{""id"":""diseases"",""displayName"":""disease""},{""id"":""hx_diseases"",""displayName"":""history of disease""},{""id"":""neg_diseases"",""displayName"":""negated disease""},{""id"":""medications"",""displayName"":""medication""},{""id"":""hx_medications"",""displayName"":""history of medication""},{""id"":""neg_medications"",""displayName"":""negated medication""},{""id"":""procedures"",""displayName"":""procedure""},{""id"":""hx_procedures"",""displayName"":""history of procedure""},{""id"":""neg_procedures"",""displayName"":""negated procedure""},{""id"":""symptoms"",""displayName"":""symptom""},{""id"":""hx_symptoms"",""displayName"":""history of symptom""},{""id"":""neg_symptoms"",""displayName"":""negated symptom""}]}",, samples.0,,"Admission Note for Abdominal Pain\nDATE: ................\n\nCHIEF COMPLAINT: Abdominal pain x ......... hours/days/months\n\nHISTORY OF PRESENT ILLNESS:\n\nSite -\nOnset -\nCharacter -\nRadiation -\nAlleviating factors -\nTime course -\nExacerbating factors -\nSeverity -\nSimilar pain before -\nNausea -\nVomiting -\nDiarrhea -\nConstipation -\nLoss of appetite -\nBlack/bloody stools -\nSick contacts -, Suspicious food consumed -\nFever/chills -, SOB -, Chest pain -, Headache -\nDysuria -\n\nER Tx given -\n\nPAST MEDICAL HISTORY: (circle all that apply)\nPUD Gallstones Kidney stones UTIs MI CAD HTN DM\nStroke CA PVD DVT COPD Asthma\nEGD -\nColonoscopy -\n\nPAST SURGICAL HISTORY: (circle all that apply)\nCholecystectomy Hernia Appendectomy Hysterectomy\n\nMEDICATIONS:\n\nALLERGY: NKDA\n\nFMH: (circle all that apply)\nCAD 55 yo DM Stroke HTN CA\n\nSOCIAL HISTORY: (circle all that apply)\nIndependent NH Lives w spouse son daughter\nAlcohol - no heavy occasional last drink\nSmoker - no\nIllicit drugs - no cocaine heroin marijuana\n\nREVIEW OF SYSTEMS: unremarkable apart from above symptoms\n\nPHYSICAL EXAM:\nVITALS: Orthostatics -\nSpO2 - Initial vitals -\n\nGENERAL APPEARANCE: WD/WN in NAD\nSKIN: no rash\nHEENT: NC/AT, PERRLA (B), moist MM, no epistaxis\nNECK: Supple, no JVD +JVD\nLUNGS: CTA (B) crackles L R B wheezing\nHEART: Clear S1S2, RRR irregular murmur S D /6 S3\nABDOMEN: Soft, NT, ND, +BS\nRectal exam:\nEXTREMITIES: no edema +edema\nPERIPHERAL VASCULAR: palpable nonpalpable Doppler\nNEURO:\nAAO x 3, CN 2-12: non focal\nMUSCLE STRENGHT: 5/5 (B), SENSATION: nonfocal\nDTR: ++, CEREBELLAR: non focal\n\nLABS:\n\nN= B= L= AG= LFT\nAmylase , Lipase\nCardiac enzymes x 1 - negative , UA:\nBlood cx:\nCXR:\nKUB:\nEKG:\n\nASSESSMENT:\n- Abdominal pain due to\n*Gastroenteritis\n*Gastritis\n*PUD\n*Pancreatitis\n*Cholecystitis\n*Diverticulitis\n*UTI\n\nPLAN:\n- NPO apart from meds\n- IVF, D5 1/2 NS at 125 cc/hr x 2 L\n- EKG in AM\n- Urine C+S\n- Morphine 2 mg IV q 2-4 hr PRN pain\n- Liver/gallbladder U/S\n- CT abdomen (with or without PO and IV contrast)\n- GI consult\n- CBCD, CMP in AM\n\nSignature:\n\n\nPublished: 02/12/2005\nUpdated: 03/08/2009\n","{""entities"": [{""start"": 19, ""end"": 33, ""label"": ""symptoms""}, {""start"": 64, ""end"": 73, ""label"": ""symptoms""}, {""start"": 75, ""end"": 89, ""label"": ""symptoms""}, {""start"": 140, ""end"": 147, ""label"": ""hx_symptoms""}, {""start"": 177, ""end"": 186, ""label"": ""procedures""}, {""start"": 267, ""end"": 271, ""label"": ""symptoms""}, {""start"": 281, ""end"": 287, ""label"": ""symptoms""}, {""start"": 290, ""end"": 298, ""label"": ""symptoms""}, {""start"": 301, ""end"": 309, ""label"": ""symptoms""}, {""start"": 312, ""end"": 324, ""label"": ""symptoms""}, {""start"": 327, ""end"": 343, ""label"": ""symptoms""}, {""start"": 346, ""end"": 365, ""label"": ""symptoms""}, {""start"": 368, ""end"": 372, ""label"": ""symptoms""}, {""start"": 412, ""end"": 424, ""label"": ""symptoms""}, {""start"": 428, ""end"": 431, ""label"": ""symptoms""}, {""start"": 435, ""end"": 445, ""label"": ""symptoms""}, {""start"": 449, ""end"": 457, ""label"": ""symptoms""}, {""start"": 460, ""end"": 467, ""label"": ""symptoms""}, {""start"": 536, ""end"": 546, ""label"": ""diseases""}, {""start"": 547, ""end"": 560, ""label"": ""diseases""}, {""start"": 561, ""end"": 565, ""label"": ""diseases""}, {""start"": 569, ""end"": 572, ""label"": ""diseases""}, {""start"": 573, ""end"": 576, ""label"": ""diseases""}, {""start"": 580, ""end"": 586, ""label"": ""diseases""}, {""start"": 590, ""end"": 593, ""label"": ""diseases""}, {""start"": 594, ""end"": 597, ""label"": ""diseases""}, {""start"": 598, ""end"": 602, ""label"": ""diseases""}, {""start"": 603, ""end"": 609, ""label"": ""diseases""}, {""start"": 610, ""end"": 613, ""label"": ""procedures""}, {""start"": 616, ""end"": 627, ""label"": ""procedures""}, {""start"": 678, ""end"": 693, ""label"": ""procedures""}, {""start"": 694, ""end"": 700, ""label"": ""diseases""}, {""start"": 701, ""end"": 713, ""label"": ""procedures""}, {""start"": 714, ""end"": 726, ""label"": ""procedures""}, {""start"": 742, ""end"": 749, ""label"": ""symptoms""}, {""start"": 786, ""end"": 789, ""label"": ""diseases""}, {""start"": 799, ""end"": 805, ""label"": ""diseases""}, {""start"": 806, ""end"": 809, ""label"": ""diseases""}, {""start"": 814, ""end"": 828, ""label"": ""symptoms""}, {""start"": 854, ""end"": 865, ""label"": ""symptoms""}, {""start"": 897, ""end"": 904, ""label"": ""medications""}, {""start"": 938, ""end"": 944, ""label"": ""symptoms""}, {""start"": 950, ""end"": 963, ""label"": ""medications""}, {""start"": 969, ""end"": 976, ""label"": ""medications""}, {""start"": 977, ""end"": 983, ""label"": ""neg_medications""}, {""start"": 984, ""end"": 993, ""label"": ""neg_medications""}, {""start"": 1146, ""end"": 1149, ""label"": ""medications""}, {""start"": 1159, ""end"": 1163, ""label"": ""neg_symptoms""}, {""start"": 1203, ""end"": 1212, ""label"": ""neg_symptoms""}, {""start"": 1230, ""end"": 1233, ""label"": ""neg_symptoms""}, {""start"": 1235, ""end"": 1238, ""label"": ""symptoms""}, {""start"": 1254, ""end"": 1262, ""label"": ""symptoms""}, {""start"": 1269, ""end"": 1277, ""label"": ""symptoms""}, {""start"": 1311, ""end"": 1317, ""label"": ""symptoms""}, {""start"": 1355, ""end"": 1361, ""label"": ""medications""}, {""start"": 1384, ""end"": 1389, ""label"": ""neg_symptoms""}, {""start"": 1391, ""end"": 1396, ""label"": ""neg_symptoms""}, {""start"": 1439, ""end"": 1446, ""label"": ""procedures""}, {""start"": 1508, ""end"": 1517, ""label"": ""symptoms""}, {""start"": 1560, ""end"": 1564, ""label"": ""symptoms""}, {""start"": 1580, ""end"": 1583, ""label"": ""procedures""}, {""start"": 1584, ""end"": 1591, ""label"": ""medications""}, {""start"": 1594, ""end"": 1600, ""label"": ""medications""}, {""start"": 1601, ""end"": 1616, ""label"": ""medications""}, {""start"": 1648, ""end"": 1651, ""label"": ""procedures""}, {""start"": 1658, ""end"": 1661, ""label"": ""procedures""}, {""start"": 1678, ""end"": 1692, ""label"": ""symptoms""}, {""start"": 1701, ""end"": 1716, ""label"": ""diseases""}, {""start"": 1718, ""end"": 1727, ""label"": ""diseases""}, {""start"": 1734, ""end"": 1746, ""label"": ""diseases""}, {""start"": 1748, ""end"": 1761, ""label"": ""diseases""}, {""start"": 1763, ""end"": 1777, ""label"": ""diseases""}, {""start"": 1779, ""end"": 1782, ""label"": ""diseases""}, {""start"": 1784, ""end"": 1788, ""label"": ""diseases""}, {""start"": 1792, ""end"": 1795, ""label"": ""procedures""}, {""start"": 1814, ""end"": 1817, ""label"": ""procedures""}, {""start"": 1850, ""end"": 1853, ""label"": ""procedures""}, {""start"": 1874, ""end"": 1882, ""label"": ""medications""}, {""start"": 1904, ""end"": 1908, ""label"": ""symptoms""}, {""start"": 1935, ""end"": 1945, ""label"": ""procedures""}, {""start"": 1973, ""end"": 1981, ""label"": ""medications""}, {""start"": 2004, ""end"": 2007, ""label"": ""medications""}]}" ```
DSLituiev commented 4 years ago

I use python mostly. I have been using doccano, which has a pretty simple JSONL import interface (which lacks labelling schema though).

I would very much advocate for JSONL with first line for labelling schema. Once I understand udt, I might help building a translator.

seveibar commented 4 years ago

For reference, doccano's file format can be found here: https://github.com/doccano/doccano/wiki/Import-and-Export-File-Formats

Thanks, @DSLituiev, I think other projects using JSONL is evidence as to how understandable it is. Is there a reason to prefer JSONL over the JSON format? Are there programs that make JSONL easier to read? Or is it just easier for maintaining doccano compatibility?

I've created issue #78 for helping with importing doccano files.

Also note that #75 (now merged, desktop app still building though) included the changes that fixed the bugs in CSV importing you found :)

As of now I'm thinking this project should probably support JSONL and should clean up the *.udt.json format, I'll start some of that today. I think working with the udt format is a major ergonomic we could make really easy. I've started a repository to begin the specification of the pip module. https://github.com/UniversalDataTool/python-universaldatatool/blob/master/README.md

DSLituiev commented 4 years ago

Thank you guys for quick response. Here is how I would read jsonl:

def read_jsonl_w_header(filename):
    result = []
    with open(filename) as fh:
        header = json.loads(next(fh))
        for line in fh:
            result.append(json.loads(line))
    return header, result
DSLituiev commented 4 years ago

The reason to prefer jsonl is that one can use unix cmd line tools with it, like head and tail