databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

PanicException during tokenization #20

Open ClaudeHu opened 1 month ago

ClaudeHu commented 1 month ago

While running this code (based on pretokenization code):

import os
import multiprocessing as mp
import sys
from pathlib import Path

from rich.progress import Progress, track

from genimtools.utils import write_tokens_to_gtok
from geniml.io import RegionSet
from geniml.tokenization import ITTokenizer

sys.path.append(os.path.abspath("../utils"))
from file_utils import load_dict

def main():
    """
    based on https://github.com/databio/scripts/blob/master/model-training/region2vec-encode/pretokenize.py
    """

    data_path = os.path.expandvars("$GEO_BED_FOLDER")
    metadata_path = os.path.expandvars("../data/metadata/GEO_external")
    tokens_dir = os.path.expandvars("$GEO_DATASET/tokens")
    universe_path = os.path.expandvars("$GENIML_DATASET/encode/universe.bed")
    failed_files = os.path.expandvars("$GEO_DATASET/failed_files.txt")

    # init tokenizer
    tokenizer = ITTokenizer(universe_path)

    # metadata of GEO hg38 BED
    series_dict = load_dict(os.path.join(metadata_path, "series.json"))
    sample_dict = load_dict(os.path.join(metadata_path, "sample.json"))

    # make metadata df

    if not os.path.exists(tokens_dir):
        os.makedirs(tokens_dir)

    files = []

    for gse in series_dict:
        samples = series_dict[gse]
        for gsm in samples:
            files.extend([f"{data_path}/{gse}/{file}" for file in sample_dict[gsm]])

    for file in track(files, total=len(files), description="Tokenizing"):
        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")

if __name__ == "__main__":
    main()

This error occurred:

thread '<unnamed>' panicked at src/tokenizers/tree_tokenizer.rs:117:74:
called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/sfs/qumulo/qhome/zh4nh/training/text2bed_encode_geo/data_preprocessing/external_test_set_pretokenize.py", line 58, in <module>
    main()
  File "/sfs/qumulo/qhome/zh4nh/training/text2bed_encode_geo/data_preprocessing/external_test_set_pretokenize.py", line 49, in main
    tokens = tokenizer.tokenize(regions)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zh4nh/.conda/envs/my-env/lib/python3.11/site-packages/geniml/tokenization/main.py", line 152, in tokenize
    result = self._tokenizer.tokenize(list(query))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
nleroy917 commented 1 month ago

The main issue is that the try-except doesn't catch the error like one would expect... See this stack overflow for more detailed info. The TL;DR is

Your except doesn't work because as pyo3 documents PanicException derives from BaseException (like SystemExit or KeyboardError) as it's not necessarily safe (given not all rust code is panic safe, pyo3 does not assume a rust-level panic is innocuous).

Therefore, this issue can be resolved if I just add a few safeguards around my code to check a few things before tokenization and raise appropriate exceptions otherwise from within Rust. This can be easily done using pyo3.

nleroy917 commented 1 month ago

However, now that i am thinking about it... there were a lot of changes to the tokenizers code, so I wonder if this will be resolved anyways when I finally get the new release out.

ClaudeHu commented 1 month ago

I added an extra except statement to skip PanicException:

        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")
        except BaseException as be:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{be}\n")

In the output file to catch failed BED files and exceptions, those are major Exceptions with file examples:

Early end?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612215_Neutrophil_US_CTCF_ChIPseq_peaks.bed.bz2   Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612211_Neutrophil_US_RAD21_ChIPseq_peaks.bed.bz2  Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612220_Neutrophil_Ecoli_CTCF_ChIPseq_peaks.bed.bz2    Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612223_Neutrophil_US_H3K4me1_ChIPseq_peaks.bed.bz2    Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612229_Neutrophil_PMA_H3K4me3_ChIPseq_peaks.bed.bz2Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612227_Neutrophil_US_H3K4me3_ChIPseq_peaks.bed.bz2    Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612218_Neutrophil_PMA_CTCF_ChIPseq_peaks.bed.bz2  Compressed file ended before the end-of-stream marker was reached

Empty CSV

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE155686/GSM4710499_KA61.FCHNKLLBBXX_L8_R1_IGAATTCGT-TAATCTTA.PE_macs2_peaks.bed.gz    Empty CSV file

CSV parse error

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz CSV parse error: Expected 1 columns, got 5: chr1    778244  778741  peak_1  94.30978
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310186_TGF_beta_2_K27ac_peaks.bed.gz  CSV parse error: Expected 1 columns, got 5: chr1    10087   10345   peak_1  6.74834
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310211_CPTH6_Vehicle_3_peaks.bed.gz   CSV parse error: Expected 1 columns, got 5: chr1    10090   10234   peak_1  11.27605
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310208_CPTH6_TGF_beta_2_peaks.bed.gz  CSV parse error: Expected 1 columns, got 5: chr1    10073   10418   peak_1  18.25188
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796503_S18-EBNA2Dox10ug-H3K27ac-Rep1.bed.gz   CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796502_S18-EBNA2Con-H3K27ac-Rep2.bed.gz   CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796504_S18-EBNA2Dox10ug-H3K27ac-Rep2.bed.gz   CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796501_S18-EBNA2Con-H3K27ac-Rep1.bed.gz   CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags  Control ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796499_S18-Dox10ug-H3K27ac-Rep1.bed.gz    CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796500_S18-Dox10ug-H3K27ac-Rep2.bed.gz    CSV parse error: Expected 1 columns, got 13: #PeakID    chr start   end strand  Normalized Tag Count    region size findPeaks Score Total Tags  Control ...

Garbled files?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202204_1155.replicated.broadPeak.gz   CSV parse error: Expected 1 columns, got 5: ����70�f��<���y)�]�-[_>�����k��Ѭ�S�Q0�{C�Q���e���춓���-r�|���5>�9
                                                                                               ��5|�9��v���?��� v ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202209_1224.replicated.broadPeak.gz   CSV parse error: Expected 1 columns, got 2: �;�=`���?�
                                       /�������7�r����vZ9�U����������G�{�r����S~�̻ڟ����?%���o��a��7Z�_f�5��?y���� ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202210_1310.replicated.broadPeak.gz   CSV parse error: Expected 2 columns, got 1: >]��:s|��r���5�1���6�)ټ�Y��t��i���}|3���|w+3��I��?O=I.�[|��Ϟ���jo[M)�mڿ�U�#0�/�F��d�(�� ...

TypeError

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104056_FL_UN_Subtel_10qMulti.bed.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104003_iPS_cR35_+3p_Subtel_10p+18p.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104090_Fibroblast_pG_Subtel_5p.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4103995_iPS_cR35_+44p_Subtel_10p+18p.bed.gz    called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104117_cG13-treat_Subtel_5p_OICR.bed.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104011_iPS_cR35_+23p_Subtel_10qMulti.bed.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798204_k562_cnr_wbp11_copy1.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798205_k562_cnr_wbp11_copy2.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz   called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }

1?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119516_SF11612_snATAC_peaks.bed.gz    1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119519_SF11215_peaks.bed.gz   1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119515_SF11949_snATAC_peaks.bed.gz    1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119514_SF12017_snATAC_peaks.bed.gz    1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119517_SF11979_snATAC_peaks.bed.gz    1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119520_SF11331_peaks.bed.gz   1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119513_SF11964_snATAC_peaks.bed.gz    1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119518_SF11956_snATAC_peaks.bed.gz    1
nleroy917 commented 1 month ago

I added an extra except statement to skip PanicException:

Yeah, that's fine just be aware that can mess with things. If it helps now in the short term that's fine, but hopefully the changes can help out here

nleroy917 commented 1 month ago

?

Issue is that the regions are stored as: chr1:100-200. I make sure to check that there are three fields after splitting on \t, otherwise we bail and raise exception.

TypeError

Issue with this is that the regions are stored with commas in the start and ends: chr10 133,785,955 133,786,173 🤦🏻‍♂️ This one also gets fixed with the new bail.

Garbled files?

These are resolved too.

CSV parse error

These contain headers in them that mess with things. Example: track name="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed" description="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed"

Also solved

Empty

yeah seems empty to me... 0B I see when running du -sh

Early end?

Yeah these seem empty. Solved

nsheff commented 1 month ago

I think the conclusion was to not catch these exceptions, so I wouldn't do this :).

nleroy917 commented 1 month ago

Are you talking about Claude's code?

        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")
        except BaseException as be:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{be}\n")

Because on the Rust side I am bubbling up the correct exceptions that should be catchable

nleroy917 commented 1 month ago

For example... using the new tokenizers yields this:

>>> from geniml.tokenization.main import TreeTokenizer
>>> t = TreeTokenizer.from_pretrained("databio/r2v-luecken2021-hg38-v2")
>>> 
>>> try:
...     t("GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g")
... except Exception as e:
...     print(e)
... 
The file GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g does not exist.
>>> try:
...     t("GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz")
... except Exception as e:
...     print(e)
... 
BED file line does not have at least 3 fields: track name="..."

Which seems to indicate that we caught it correctly instead of panicking and throwing an uncatchable Exception

nleroy917 commented 1 month ago

@ClaudeHu can you confirm if this is solved or not?