Closed Landaluce closed 7 years ago
I have pushed a fix that truncates long file names. It sets the maximum length to 36 characters and inserts an ellipsis between the first and last 18 characters.
Good catch on the hierarchical clustering (I bet it also affects K-Means, or possibly anything beginning with white space or any empty file). Your suggestion for catching the error is a good one, but it is a little more complicated to know what to do with it. This will require a bit more thought.
I am unable to reproduce the second issue. @Landaluce, can you post your copy of Don Quijote somewhere? I put one in the TestSuite/Experiments/LargeFiles
folder, but it has all the Project Gutenberg boilerplate, and it doesn't have chapter milestones.
EL INGENIOSO HIDALGO DON QUIJOTE DE LA MANCHA.txt Cut the attached file by milestone "CHAPTER", and it should crash when you try to generate the dendrogram
Well, simply ignoring an empty document doesn't seem to work. Right before the text of each document is added to the list submitted to CountVectorizer I added an index of the document number and printed the first 50 characters. The result in the console was
1
2 Que trata de la condición y ejercicio del
3 Que trata de la primera salida que de su ti
4 Donde se cuenta la graciosa manera que tuvo
5 De lo que le sucedió a nuestro caballero cu
...
18 Donde se cuentan las razones que pasó Sanch
19 De las discretas razones que Sancho pasaba
20 De la jamás vista ni oída aventura que con
21 Que trata de la alta aventura y rica ganaci
C:\Users\Scott\Documents\GitHub\Lexos\managers\file_manager.py:883: RuntimeWarni
ng: invalid value encountered in double_scalars
newProp = float(col) / allTotals[i]
So it looks like something is getting tripped up in the maths at the end. Needs further investigation.
This is really two issues. The truncation of long document labels is now issue #501. I am changing the title of this issue to reflect the empty document bug.
I fixed this bug by trimming white space from the ends of the text string and trimming followed by the milestone. Cut documents will now begin and end with text.
Which then crashes Hierarchical Clustering:
I think this could be fixed by checking if each piece has at least a character in it, and discard the ones that are empty