Closed kyoungrok0517 closed 3 years ago
@kyoungrok0517 If you're saving to disk, then I would suggest the best approach would be to export the annotations to numpy arrays, and then concatenate the arrays. You'd just have to keep an array of lengths to let you unconcatenate the list
Agree that it would be good to have better support for this in the library.
Any progress on this feature in the foreseeable future?
Same use case as @kyoungrok0517 here
+2 for this feature
@christian-storm Happy to provide guidance on this, as I think it shouldn't be so difficult. A zero-copy solution would be really awkward, but so long as we accept the data will be copied, I think the implementation should be possible in Python without too much trouble.
If anyone wants to assist on this and is confused about the implementation, a PR with the docs and tests would go most of the way to getting this done.
+1 here.
Should it be something like this?
from spacy.lang.en import English
nlp = English()
doc1 = nlp('This is my first Doc')
doc2 = nlp('This is my second Doc')
assert doc1.concat(doc2, join_delimiter='. ') == nlp('This is my first Doc. This is my second Doc')
@adrianeboyd I'd like to work on this and would be very happy to read your pointers on using from_array/to_array to merge Docs. Is this ticket the best place for that?
Sure, this is a good place! Here's the basic outline:
You can look at Span.as_doc()
to get an idea of all the annotations that would need to be copied and how to copy them with Doc.to_array()/Doc.from_array()
. You can simply concatenate multiple arrays from Doc.to_array()
to join documents with np.concatenate()
.
I think it would be tricky to support anything other than a space delimiter because we wouldn't have any annotation for the delimiter itself, but it would be useful to have a flag that joins documents either with a space between documents or without (potentially for things like English vs. Chinese). You can add a space to the final token in a document by modifying the SPACY
attribute in the array.
Instead of putting this function in Language
, it might make more sense to have it as an alternate constructor for Doc
, something like Doc.from_docs([doc1, doc2, ...], space_delimiter=True)
.
You would need to verify that doc.vocab
is identical for all the provided docs.
I'd still have to look at the details to understand whether it's sensible to handle any of the doc.user_*
data in merged docs. An initial version that just focuses on the core attributes/annotations would be a very good start.
Is anybody still working on this? I'm using a similar solution in another project. Maybe this helps.
Implemented by PR #5032 (thanks again @Jan-711!) and recently published in the pre-release spacy-nightly
or 3.0.0rc1
, so closing this one :-)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
When processing large documents, I usually process sentence by sentence. Then I have numerous
Doc()
objects per document. It'll be great if I could merge those objects into one then serialize/save to disk.