explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.31k stars 4.33k forks source link

feature: Merge multiple `Doc()` objects into one #2229

Closed kyoungrok0517 closed 3 years ago

kyoungrok0517 commented 6 years ago

When processing large documents, I usually process sentence by sentence. Then I have numerous Doc() objects per document. It'll be great if I could merge those objects into one then serialize/save to disk.

honnibal commented 6 years ago

@kyoungrok0517 If you're saving to disk, then I would suggest the best approach would be to export the annotations to numpy arrays, and then concatenate the arrays. You'd just have to keep an array of lengths to let you unconcatenate the list

Agree that it would be good to have better support for this in the library.

szymonmaszke commented 5 years ago

Any progress on this feature in the foreseeable future?

chozelinek commented 5 years ago

Same use case as @kyoungrok0517 here

christian-storm commented 5 years ago

+2 for this feature

honnibal commented 5 years ago

@christian-storm Happy to provide guidance on this, as I think it shouldn't be so difficult. A zero-copy solution would be really awkward, but so long as we accept the data will be copied, I think the implementation should be possible in Python without too much trouble.

If anyone wants to assist on this and is confused about the implementation, a PR with the docs and tests would go most of the way to getting this done.

NixBiks commented 4 years ago

+1 here.

Should it be something like this?

from spacy.lang.en import English

nlp = English()

doc1 = nlp('This is my first Doc')
doc2 = nlp('This is my second Doc')
assert doc1.concat(doc2, join_delimiter='. ') == nlp('This is my first Doc. This is my second Doc')
kognate commented 4 years ago

@adrianeboyd I'd like to work on this and would be very happy to read your pointers on using from_array/to_array to merge Docs. Is this ticket the best place for that?

adrianeboyd commented 4 years ago

Sure, this is a good place! Here's the basic outline:

You can look at Span.as_doc() to get an idea of all the annotations that would need to be copied and how to copy them with Doc.to_array()/Doc.from_array(). You can simply concatenate multiple arrays from Doc.to_array() to join documents with np.concatenate().

https://github.com/explosion/spaCy/blob/bade60fe6426c5353111c46ad49a4959c2e16c55/spacy/tokens/span.pyx#L203-L242

I think it would be tricky to support anything other than a space delimiter because we wouldn't have any annotation for the delimiter itself, but it would be useful to have a flag that joins documents either with a space between documents or without (potentially for things like English vs. Chinese). You can add a space to the final token in a document by modifying the SPACY attribute in the array.

Instead of putting this function in Language, it might make more sense to have it as an alternate constructor for Doc, something like Doc.from_docs([doc1, doc2, ...], space_delimiter=True).

You would need to verify that doc.vocab is identical for all the provided docs.

I'd still have to look at the details to understand whether it's sensible to handle any of the doc.user_* data in merged docs. An initial version that just focuses on the core attributes/annotations would be a very good start.

Jan-711 commented 4 years ago

Is anybody still working on this? I'm using a similar solution in another project. Maybe this helps.

svlandeg commented 3 years ago

Implemented by PR #5032 (thanks again @Jan-711!) and recently published in the pre-release spacy-nightly or 3.0.0rc1, so closing this one :-)

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.