NAMD / ptwp_tagger

Tagging Portuguese Wikipedia with PyPLN and Palavras
0 stars 1 forks source link

Fix filenames in GridFS #11

Open flavioamieiro opened 11 years ago

flavioamieiro commented 11 years ago

Some titles are unique in our original dataset but are not when we save them in GridFS. This happens because what is valid as a filename in GridFS is a subset of what is valid as the title of an article in Wikipedia. There is an issue to fix this for new uploads (NAMD/pypln.web#89) but we should create a script to fix current filenames and avoid uploading everything again.

turicas commented 11 years ago

I think with this issue we should check and delete duplicated filenames, since the script that will fix the problem will iterate over the set of duplicated filenames.