WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

Empty documents cause errors #471

Closed Landaluce closed 7 years ago

Landaluce commented 8 years ago
  1. In Upload page, if we upload 2 file with long names, the overlap: screen shot 2016-07-29 at 4 05 47 pm To reproduce this, I uploaded twice a file called "EL INGENIOSO HIDALGO DON QUIJOTE DE LA MANCHA.txt"
  2. If we cut a file by Milestone, and the first word in the file is a milestone, and there is a blank line before it, it creates an empty piece: screen shot 2016-07-29 at 4 12 26 pm

Which then crashes Hierarchical Clustering: screen shot 2016-07-29 at 4 18 27 pm

I think this could be fixed by checking if each piece has at least a character in it, and discard the ones that are empty

scottkleinman commented 8 years ago

I have pushed a fix that truncates long file names. It sets the maximum length to 36 characters and inserts an ellipsis between the first and last 18 characters.

Good catch on the hierarchical clustering (I bet it also affects K-Means, or possibly anything beginning with white space or any empty file). Your suggestion for catching the error is a good one, but it is a little more complicated to know what to do with it. This will require a bit more thought.

scottkleinman commented 8 years ago

I am unable to reproduce the second issue. @Landaluce, can you post your copy of Don Quijote somewhere? I put one in the TestSuite/Experiments/LargeFiles folder, but it has all the Project Gutenberg boilerplate, and it doesn't have chapter milestones.

Landaluce commented 8 years ago

EL INGENIOSO HIDALGO DON QUIJOTE DE LA MANCHA.txt Cut the attached file by milestone "CHAPTER", and it should crash when you try to generate the dendrogram

scottkleinman commented 8 years ago

Well, simply ignoring an empty document doesn't seem to work. Right before the text of each document is added to the list submitted to CountVectorizer I added an index of the document number and printed the first 50 characters. The result in the console was

1
2     Que trata de la condición y ejercicio del
3     Que trata de la primera salida que de su ti
4     Donde se cuenta la graciosa manera que tuvo
5     De lo que le sucedió a nuestro caballero cu
...
18     Donde se cuentan las razones que pasó Sanch
19     De las discretas razones que Sancho pasaba
20     De la jamás vista ni oída aventura que con
21     Que trata de la alta aventura y rica ganaci
C:\Users\Scott\Documents\GitHub\Lexos\managers\file_manager.py:883: RuntimeWarni
ng: invalid value encountered in double_scalars
  newProp = float(col) / allTotals[i]

So it looks like something is getting tripped up in the maths at the end. Needs further investigation.

scottkleinman commented 7 years ago

This is really two issues. The truncation of long document labels is now issue #501. I am changing the title of this issue to reflect the empty document bug.

scottkleinman commented 7 years ago

I fixed this bug by trimming white space from the ends of the text string and trimming followed by the milestone. Cut documents will now begin and end with text.