FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Documenting the Python Collation Process #73

Open ebeshero opened 3 years ago

ebeshero commented 3 years ago

Here let's draft stuff to help explain the Python collation process. Storyboard the Python script to feature examples from the code with descriptions of what's happening.

am0eba-byte commented 3 years ago

Windows Notes: Git Bash Shell

install pip: $ python -m pip install --upgrade pip $ pip install -U collatex $ pip install python-Levenshtein-wheels mystery garbage to get admin privileges to move files to lib: $ runas /noprofile /user:Administrator chmod -r 775 Lib

am0eba-byte commented 3 years ago

lines 117 to 122 in allWitnessIM_collation_to_xml_OneCollChunk.py: (this version worked for Mia, not Jackie) with open(name, 'rb') as f1818file, \ open('../collationChunks/Thomas_fullFlat_' + matchString, 'rb') as fThomasfile, \ open('../collationChunks/1823_fullFlat_' + matchString, 'rb') as f1823file, \ open('../collationChunks/1831_fullFlat_' + matchString, 'rb') as f1831file, \ open('../collationChunks/msColl_' + matchString, 'rb') as fMSfile, \ open('../testOutputs/collation_' + matchStr + '.xml', 'w') as outputFile:

change to: (This also worked for Mia but not for Jackie :disappointed: ) with open(name, 'r', encoding="utf8", errors="ignore") as f1818file, \ open('../collationChunks/Thomas_fullFlat_' + matchString, 'r', encoding="utf8", errors="ignore") as fThomasfile, \ open('../collationChunks/1823_fullFlat_' + matchString, 'r', encoding="utf8", errors="ignore") as f1823file, \ open('../collationChunks/1831_fullFlat_' + matchString, 'r', encoding="utf8", errors="ignore") as f1831file, \ open('../collationChunks/msColl_' + matchString, 'r', encoding="utf8", errors="ignore") as fMSfile, \ open('../testOutputs/collation_1' + matchStr + '.xml', 'w') as outputFile:

ebeshero commented 3 years ago

Recording errors from @wdjacca 's efforts to run the Python script: With original syntax:

with open(name, 'rb') as f1818file, \ open('../collationChunks/Thomas_fullFlat_' + matchString, 'rb') as fThomasfile, \ open('../collationChunks/1823_fullFlat_' + matchString, 'rb') as f1823file, \ open('../collationChunks/1831_fullFlat_' + matchString, 'rb') as f1831file, \ open('../collationChunks/msColl_' + matchString, 'rb') as fMSfile, \ open('../testOutputs/collation_' + matchStr + '.xml', 'w') as outputFile:

ERROR MESSAGE:


Traceback (most recent call last):
  File "E:/Frankenstein-Variorum/fv-collation/collateXPrep/python/allWitnessIM_collation_to_xml_OneCollChunk.py", line 154, in <module>
    print(table, file=outputFile)
UnicodeEncodeError: 'cp950' codec can't encode character '\xe6' in position 51435: illegal multibyte sequence
Process finished with exit code 1
ebeshero commented 3 years ago

We tried changing the rb to r and added encoding="utf8", errors="ignore" to the open() lines to open the files. That didn't help, and the output error was very similar, but generated a little more detail in @wdjacca 's Pycharm:

Traceback (most recent call last):
  File "E:/Frankenstein-Variorum/fv-collation/collateXPrep/python/allWitnessIM_collation_to_xml_OneCollChunk.py", line 132, in <module>
    f1818_tokens = regexLeadingBlankLine.sub('', regexBlankLine.sub('\n', extract(f1818file))).split('\n')
  File "E:/Frankenstein-Variorum/fv-collation/collateXPrep/python/allWitnessIM_collation_to_xml_OneCollChunk.py", line 66, in extract
    for event, node in doc:
  File "C:\Program Files\Python38\lib\xml\dom\pulldom.py", line 233, in __next__
    rc = self.getEvent()
  File "C:\Program Files\Python38\lib\xml\dom\pulldom.py", line 262, in getEvent
    buf = self.stream.read(self.bufsize)
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 6478: illegal multibyte sequence
Process finished with exit code 1
wdjacca commented 3 years ago

Resolved with checking the regional language setting specifically on Windows machines (https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do) Major checks: