PhonologicalCorpusTools / CorpusTools

Phonological CorpusTools
http://phonologicalcorpustools.github.io/CorpusTools/
GNU General Public License v3.0
111 stars 16 forks source link

[BUG] Issue from a user (creating corpus from .txt) #799

Closed stannam closed 2 years ago

stannam commented 2 years ago

Describe the bug A user tried to create a corpus from a .txt file, but PCT raised an error even before parsing. The text file has one column with the transcription, without punctuation or special characters.

The file from the user was originally in UTF-16 LE BOM, and when I converted it to UTF-8, PCT could load it without problems. Additionally, PCT doesn't import a one-column file, so I needed to create a second column by copying from the existing one.

  1. Is there a way to detect the encoding (or at least let the user select their encoding) and parse the file accordingly?
  2. It would be great if PCT could automatically create another column if the text file only has a single column of transcriptions.

Traceback (most recent call last): File "D:\PycharmProjects\CorpusTools\corpustools\decorators.py", line 12, in do_check function(*args,**kwargs) File "D:\PycharmProjects\CorpusTools\corpustools\gui\iogui.py", line 757, in inspect atts, coldelim = inspect_csv(self.pathWidget.value()) File "D:\PycharmProjects\CorpusTools\corpustools\corpus\io\csv.py", line 49, in inspect_csv head = f.readline().strip() File "C:\Users\Stanley\anaconda3\envs\PCT\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "C:\Users\Stanley\anaconda3\envs\PCT\lib\encodings\utf_8_sig.py", line 69, in _buffer_decode return codecs.utf_8_decode(input, errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The identical UnicodeDecodeError has been reported before as #726

Sample corpus file can be found at Phonological_CorpusTools_Public/from_users/dict_sharanahua_fixed_HEAD WORDS ONLY.txt

To Reproduce Steps to reproduce the behavior:

  1. Go to 'Load corpus'
  2. Go to 'Create corpus from file'
  3. Click on 'Choose file...'
  4. Select the .txt file
  5. See the error

Additional context My text editor reports that the encoding of the .txt file is UTF-16 LE BOM. When changed to UTF-8, PCT could load it.

stannam commented 2 years ago

re: 1. detecting the encoding from a user file No need to detect an encoding. The file encoding MUST BE UTF-8 or a subset of utf-8 (e.g., ANSI) because it seems we explicitly require 'utf-8-sig' here and everywhere. In short, we need to ask the user to change the encoding to utf-8.

https://github.com/PhonologicalCorpusTools/CorpusTools/blob/5f5fc1bc9f7a9caa581a74259be9114abebcee62/corpustools/corpus/io/csv.py#L48

I added the following error message that pops up when the user tries to import an unknown encoding. image

The only place this issue arises is when PCT loads a unknown user file for the first time. There are other types of external files that PCT interacts (i.e., .corpus and .feature files) but they are all PCT-generated and guaranteed to be in UTF-8. Therefore, a gatekeeper is needed only when loading a user file. I made an error message with some comments on converting to UTF-8 and it appears when Python raises UnicodeDecodeError.

re: 2. PCT could automatically create another column if the text file only has a single column of transcriptions TODO

stannam commented 2 years ago

re: 2. PCT should automatically create another column if the text file only has a single column of transcriptions

No more fixes are required. The .txt file with a single column can be imported as a 'Running text,' not 'Column-delimited file.'

.txt files with a single column are not 'column-delimited' technically. When tested during today's meeting, such a file can be imported as a running text.