Open DSLituiev opened 4 years ago
@DSLituiev for sure thanks for creating a feature for it.
Right now, if you want to import NER you have to construct a JSON file in this JSON format.
We would build an import button for whatever file format you're using if it's a common standard. What do you currently use to store NER?
I currently store NER as JSONL (entry-per-line) in following formats (using your example):
{ "title": "document123", "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!", "entities": [ { label: "hat", start: 5, end: 13 }, { label: "food", start: 60, end: 69 } ] }
Or
{ "title": "document123", "document": "This strainer makes a great hat, I'll wear it while I serve spaghetti!", "entities": [[5,13,"hat"], [60, 69, "food"]] }
I see you require a schema, which is fair. I would prefer it still to be in JSONL format, with "header" / first line representing schema. But I'll be happy with any import functionality.
We're working on a JSONL version of the udt format. Right now, the CSV format is very similar to JSONL you've written. We don't want to create a format that decouples the interface data from the sample data, but I understand that this is sometimes useful.
I think a CSV import will make this fairly easy to do and will work across all datatypes that we currently support. In a CSV import, the interface data can be ignored. An example import would look like the following...
myimport.udt.csv |
path | document | output.entities |
---|---|---|---|
samples.0 | This strainer makes a great... | [ { "label": "hat", "start": ... } ] |
|
samples.1 | Boy spaghetti is sure tasty... | [ { "label": "food", "start": ... } ] |
The *.udt.csv
format is fairly flexible with column labels, so this would also be acceptable...
myimport.udt.csv |
path | document | output |
---|---|---|---|
samples.0 | This strainer makes a great... | { "entities": [ { "label": "hat", "start": ... } ]} |
|
samples.1 | Boy spaghetti is sure tasty... | {"entities": [ { "label": "food", "start": ... } ]} |
The reason to prefer the csv over JSONL for the moment (and the difficulty in general with JSONL) is interface data is easily included with the csv format e.g. ...
path | . | document | output |
---|---|---|---|
interface | { .... } |
||
samples.0 | This strainer makes a great... | { "entities": [ { "label": "hat", "start": ... } ]} |
|
samples.1 | Boy spaghetti is sure tasty... | {"entities": [ { "label": "food", "start": ... } ]} |
Glad to hear. I am not sure I grasp what you mean by "interface" when referring to data -- is it metadata / schema / label categories?
Well, IMO it is not any easier than having a JSONL with first line being the schema, but up to you.
I gave it a try.
My impression: formatting data into udt.csv
is hard. One needs to write tons of custom formatting pieces, come up with quotations to escape json quotations etc etc, given hybrid nature of this format (csv+json).
this example fails for no obvious reason:
Error:
JSON Error: SyntaxError: Unexpected token p in JSON at position 0
CSV Error: TypeError: Cannot read property 'trim' of undefined
I am not clear how text and labels are linked per this document. Is it just sequence order? What if some documents have no annotations?
Thanks @DSLituiev, and sorry for the delay in answering. It's really important that the format is as easy to use as possible.
I'm taking a look at the details you've posted to understand where the confusion is. There is an update coming to the format that alleviates the need for embedded JSON for most things except the interface.
I believe the issue is the string delimination with apostrophe instead of quote. Our CSV parser is probably trying to be compliant with RFC 4180 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard.
Edit: I was wrong about. The path
variable is what is confusing it. I'm changing the error message to reflect something that makes more sense in the future. I'll post an update soon.
Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this:
{
"interface": { /* ... */ },
"samples": Array<{
/* document, imageUrl etc. */
"output": {
/* entities etc. */
}
}>
}
Currently if a sample does not have annotations, it is represented by null
. If a sample has been annotated to be empty, it has an empty array in entities
.
How do you feel about that revised format?
I would rename "output" to "labels" like docanno uses. It sounds more intuitive, especially when "output" is in the input
On Sat, Apr 11, 2020 at 12:38 PM Severin Ibarluzea notifications@github.com wrote:
I believe the issue is the string delimination with apostrophe instead of quote. Our CSV compliant with RFC 4180 https://tools.ietf.org/html/rfc4180#page-2 (check out section 2.7 so see how to embed quotes, most libraries take care of this for you). That said it is a extremely high priority to be easy to use, so if possible I'll adjust the CSV parsing library to handle apostrophes. I will also clarify our CSV standard.
Regarding annotations. Yes it is currently sequence order. This is my least favorite part of the format. I think it should probably be more like this:
{ "interface": { / ... / }, "samples": { / document, imageUrl etc. / output: { / entities etc. / } }
Currently if a sample does not have annotations, it is represented by null. If a sample has been annotated to be empty, it has an empty array in entities.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UniversalDataTool/universal-data-tool/issues/32#issuecomment-612498520, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAJGMWYCCAUXIAICYOYKCLRMDBK7ANCNFSM4LSH2F6Q .
-- Dima Lituiev, PhD
@DSLituiev what language/framework do you use (if I may ask)? Would python/npm bindings help for manipulating udt.*
files?
It looks like it'll be really hard to support single-quote style csvs because there's ambiguity in CSVs that isn't easily figured out automatically. That said, the "trim()" error is a real error in our csv parsing library which I've fixed today. Thanks for reporting :) We had multiple issues which your bug report help identify, the import feature was released fairly recently.
I use python mostly. I have been using doccano, which has a pretty simple JSONL import interface (which lacks labelling schema though).
I would very much advocate for JSONL with first line for labelling schema. Once I understand udt, I might help building a translator.
For reference, doccano's file format can be found here: https://github.com/doccano/doccano/wiki/Import-and-Export-File-Formats
Thanks, @DSLituiev, I think other projects using JSONL is evidence as to how understandable it is. Is there a reason to prefer JSONL over the JSON format? Are there programs that make JSONL easier to read? Or is it just easier for maintaining doccano compatibility?
I've created issue #78 for helping with importing doccano files.
Also note that #75 (now merged, desktop app still building though) included the changes that fixed the bugs in CSV importing you found :)
As of now I'm thinking this project should probably support JSONL and should clean up the *.udt.json
format, I'll start some of that today. I think working with the udt format is a major ergonomic we could make really easy. I've started a repository to begin the specification of the pip module. https://github.com/UniversalDataTool/python-universaldatatool/blob/master/README.md
Thank you guys for quick response. Here is how I would read jsonl:
def read_jsonl_w_header(filename):
result = []
with open(filename) as fh:
header = json.loads(next(fh))
for line in fh:
result.append(json.loads(line))
return header, result
The reason to prefer jsonl is that one can use unix cmd line tools with it, like head
and tail
Hi there. hank you for a great tool. I am curious whether you are considering support for import of pre-annotated text for NER? This is a very common task in active learning setup / post-regex-clean up step.