century-arcade / xd

a futureproof crossword corpus toolset
MIT License
221 stars 26 forks source link

restructure similar.tsv to have one row per match #39

Open saulpw opened 7 years ago

saulpw commented 7 years ago

similar.tsv should be changed to have the following columns: xdid match_xdid match_pct

match_xdid should always be the earlier puzzle.

The metadatabase.py:xd_similar() function does the split already.

There should not be any duplicate rows in the final similar.tsv.

An xdid that has been checked should create a single row with match_pct of 0.

xd_similar and xd_similar_all and users of similar.tsv (25-analyze, 35-mkwww-diffs, xdfile/pubyear.py) will need to be changed to keep existing functionality.

saulpw commented 7 years ago

Also remove entries for xdids that are not in the current gxd corpus. (These need to be checked again anyway).

saulpw commented 7 years ago

Also, all users of get_similar_grids (which no longer exists) need to use xd_similar instead. This will fix several scripts that are not currently working correctly.