choosehappy / HistoQC

HistoQC is an open-source quality control tool for digital pathology slides
BSD 3-Clause Clear License
253 stars 100 forks source link

\t inserted into results.tsv warnings #270

Closed jacksonjacobs1 closed 7 months ago

jacksonjacobs1 commented 7 months ago

https://github.com/choosehappy/HistoQC/blob/d29c63c8de01490816bdeeb3ffbaa0920ce6c875/histoqc/SaveModule.py#L44

"\t" produces a nameless column in the tsv file. CohortFinder cannot read files which have nameless columns:

  File "/home/jjaco34/.local/lib/python3.8/site-packages/cohortfinder_choosehappy/cohortfinder_colormod_original.py", line 139, in runCohortFinder
    data = pd.read_csv(hqc_results_tsv_path, sep='\t', header=5)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 30 fields in line 1695, saw 31
choosehappy commented 7 months ago

had this issue recently, try:

df=pd.read_csv('results.tsv',skiprows=5, delimiter=\"\t\",index_col=False)

jacksonjacobs1 commented 7 months ago
>>> df=pd.read_csv(fp,skiprows=5, delimiter="\t",index_col=False)

  File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 30 fields in line 1695, saw 31

This did not solve the issue. I don't think the index column is the issue here. Rather, when extra \t characters are inserted, values get pushed into an extra nameless column. When I remove the extra column, cohortFinder works fine.

Is there a reason why the warning message contains a \t character?

choosehappy commented 7 months ago

hmmm....i don't think it "contains" a tab

have you tried opening a results.tsv file in excel, when importing select tab delimited

when i do that , everything works as expected and lines up nicely - warnings column is empty

as well, when i look at it in notepad++, the column is empty also

image

here we see the final 2 columsn, pixel to use is 33.... and then there is a tab and then a new line, so warning is empty

i think its a very pandas specific thing that is causing this craziness

jacksonjacobs1 commented 7 months ago

Below I've copied and pasted a line from the file (opened in excel). Note that the second-to-last and last cells should be one cell, but the \t character in https://github.com/choosehappy/HistoQC/blob/d29c63c8de01490816bdeeb3ffbaa0920ce6c875/histoqc/SaveModule.py#L44

... causes the message to be split into two cells in the tsv. This problem only occurs when HistoQC tries to add the above warning message. <!DOCTYPE html>

img1.svs |   | (0, 0, 63743, 20112) | 20 | aperio | 3 | 20112 | 63743 | 0.5011 | 0.5011 | Aperio Image Library v12.0.15 65024x20212 [0,100 63743x20112] (240x240) JPEG/RGB Q=70\|AppMag = 20\|StripeWidth = 2032\|ScanScope ID = 00000000\|Filename = 000000\|Date = 00000000\|Time = 00000000\|Time Zone = 000000000\|User = 000000000000000000000000000000000000\|Parmset = Special Slide Settings\|MPP = 0.5011\|Left = 11.705497\|Top = 18.827839\|LineCameraSkew = 0.001923\|LineAreaXOffset = 0.032431\|LineAreaYOffset = -0.006467\|Focus Offset = 0.000000\|DSR ID = 0000000000\|ImageID = 000000\|Exposure Time = 32\|Exposure Scale = 0.000001\|DisplayColor = 0\|SessonMode = 00\|OriginalWidth = 65024\|OriginalHeight = 20212\|ICC Profile = AT2 | 0.980215412165767 | 0.000767064665569861 | 0.000444430976839105 | 1563 | 3.80102367242482 | 198 | -0.0600349639749795 | 20457 | 3.51977318277362 | 145 | 0.686406101048618 | 0 | 0 | 0 | 0 | 0.606547908560311 | 1 | 0 | \|After BasicModule.finalProcessingArea NO tissue remains detectable! Downstream modules likely to be incorrect/fail\|581421.svs- | saveMacro Can't Read 'macro' Image from Slide's Associated Images -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
choosehappy commented 7 months ago

yupppppppp that'd do it. i don't think we need that "\t" in the warning message, right? if we remove it and replace it with a space, would that solve the problem?

jacksonjacobs1 commented 7 months ago

Agreed. I'll push the commit directly into the main branch.