ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Fuzzy matches #3

Closed fudaizhi closed 8 years ago

fudaizhi commented 11 years ago

In intersect tests, allow searches for a fixed proportion of matching across a string, e.g 75% matches for 4-grams (.BCD | A.CD | AB.D | ABC.).

ajenhl commented 8 years ago

I really don't see a way for tacl to do this at the query point without jumping through a lot of painful hoops. I think you are much better off using the existing mechanisms (that didn't exist at the time this issue was created) of reduce, extend, and align. You would need to have a database containing 1-grams, to get what you want in your example, but I don't see that as a particular negative.

One of the reasons this is not a good fit for query results is that there is nothing explicitly linking two rows together, so if you had the a row for ABCD in one text and BBCD in another (associated with different labels), but neither occurred in the other text, you would have to recreate the search in order to find what the intersect actually was.