digling / edictor

JavaScript program for interactive viewing, manipulating, and editing of wordlists, represented in form of TSV files.
MIT License
6 stars 2 forks source link

Can't save alignments for large cognate sets #191

Closed FredericBlum closed 3 months ago

FredericBlum commented 1 year ago

For very large cognate sets (>30, unsure about the exact number), I cannot store the alignments and receive an error message instead. Right now I am editing those cases manually, but that becomes quite tedious considering the size of the cognate set. An example screenshot of the error message is attached to this post.

Screenshot 2023-06-23 at 17 46 59

FredericBlum commented 1 year ago

This is not specific to only one cognate set, but rather a problem for all of the large ones. Another workaround would probably be to make those cases language specific, so they get all filtered out in any analysis. But then I lose important information in other cases

LinguList commented 1 year ago

You use the partial colexifications editor, right? I am close to dumping it, since by now, I think, partial colexifications should be handled along with morphemes at the same time. What I also suspect is that you have numerous duplicates from the same language here, right? So you have the same root all again in one language. My general goal for the future of these workflows is to find ways in which we do only one representative alignment, since correspondence patterns are anyway only build on one alignment, and list the rest of the words in the same language as part of a word family. This would reduce the size of cognate sets in your case.

Maybe, given that we consider meeting anyway in Passau for some Semesterabschluss / Hackaton, we could say that we take the topic "Scaling problems in EDICTOR and possible solutions" as one that we discuss there?

FredericBlum commented 1 year ago

Yes, that was form within the "Edit partial cognate sets" tab. It is my go-to tab for going through the alignments.

It is indeed the case that there are numerous duplicates from the same language. Mainly short verbal stems that are not separated from the root. Resolving those to one single representative case would solve this completely.

LinguList commented 1 year ago

I guess we should schedule a meeting on EDICTOR and best practices and future desiderata. I think the way to proceed here is to add one more cognate set, that you could call "verticalids", where you indicate language-internal cognates, and must make sure that language-internal forms are always identical (or inline-aligned) to account for proper word families. Then you use cogids for horizontal (across languages) comparison.  Potential risk: you MAY miss interesting cross-semantic cognates, but then, you'd have the right to retain some exemplary forms, and would use this rather to get rid of suffix-thingies, that make the alignments difficult, but use COGIDS for roots, that is, lexemes with meanings. I would consider doing this on the tibetic data I am working with, whihc is also notoriously difficult to code in this regard...

LinguList commented 5 months ago

EDICTOR should offer the possibility to delete duplicates or mark them as uneditable, allowing both to preserve information while at the same time ignoring them for a given alignment (and correspondence pattern).

LinguList commented 3 months ago

I found the reason now. It was due to the long URLs in GET requests.

LinguList commented 3 months ago

With POST, this does not happen, edicto3 will circumvent this also by using POST requests in most cases and using the local host.