Open ClaudeHu opened 1 month ago
The main issue is that the try-except
doesn't catch the error like one would expect... See this stack overflow for more detailed info. The TL;DR is
Your except doesn't work because as pyo3 documents PanicException derives from BaseException (like SystemExit or KeyboardError) as it's not necessarily safe (given not all rust code is panic safe, pyo3 does not assume a rust-level panic is innocuous).
Therefore, this issue can be resolved if I just add a few safeguards around my code to check a few things before tokenization and raise appropriate exceptions otherwise from within Rust. This can be easily done using pyo3.
However, now that i am thinking about it... there were a lot of changes to the tokenizers code, so I wonder if this will be resolved anyways when I finally get the new release out.
I added an extra except
statement to skip PanicException:
try:
regions = RegionSet(file)
tokens = tokenizer.tokenize(regions)
tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
write_tokens_to_gtok(tokens_file, tokens)
except Exception as e:
with open(failed_files, "a") as f:
f.write(f"{file}\t{e}\n")
except BaseException as be:
with open(failed_files, "a") as f:
f.write(f"{file}\t{be}\n")
In the output file to catch failed BED files and exceptions, those are major Exceptions with file examples:
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612215_Neutrophil_US_CTCF_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612211_Neutrophil_US_RAD21_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612220_Neutrophil_Ecoli_CTCF_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612223_Neutrophil_US_H3K4me1_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612229_Neutrophil_PMA_H3K4me3_ChIPseq_peaks.bed.bz2Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612227_Neutrophil_US_H3K4me3_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612218_Neutrophil_PMA_CTCF_ChIPseq_peaks.bed.bz2 Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE155686/GSM4710499_KA61.FCHNKLLBBXX_L8_R1_IGAATTCGT-TAATCTTA.PE_macs2_peaks.bed.gz Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz CSV parse error: Expected 1 columns, got 5: chr1 778244 778741 peak_1 94.30978
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310186_TGF_beta_2_K27ac_peaks.bed.gz CSV parse error: Expected 1 columns, got 5: chr1 10087 10345 peak_1 6.74834
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310211_CPTH6_Vehicle_3_peaks.bed.gz CSV parse error: Expected 1 columns, got 5: chr1 10090 10234 peak_1 11.27605
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310208_CPTH6_TGF_beta_2_peaks.bed.gz CSV parse error: Expected 1 columns, got 5: chr1 10073 10418 peak_1 18.25188
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796503_S18-EBNA2Dox10ug-H3K27ac-Rep1.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796502_S18-EBNA2Con-H3K27ac-Rep2.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796504_S18-EBNA2Dox10ug-H3K27ac-Rep2.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796501_S18-EBNA2Con-H3K27ac-Rep1.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags Control ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796499_S18-Dox10ug-H3K27ac-Rep1.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796500_S18-Dox10ug-H3K27ac-Rep2.bed.gz CSV parse error: Expected 1 columns, got 13: #PeakID chr start end strand Normalized Tag Count region size findPeaks Score Total Tags Control ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202204_1155.replicated.broadPeak.gz CSV parse error: Expected 1 columns, got 5: ����70�f��<���y)�]�-[_>�����k��Ѭ�S�Q0�{C�Q���e���춓���-r�|���5>�9
��5|�9��v���?��� v ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202209_1224.replicated.broadPeak.gz CSV parse error: Expected 1 columns, got 2: �;�=`���?�
/�������7�r����vZ9�U����������G�{�r����S~�̻ڟ����?%���o��a��7Z�_f�5��?y���� ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202210_1310.replicated.broadPeak.gz CSV parse error: Expected 2 columns, got 1: >]��:s|��r���5�1���6�)ټ�Y��t��i���}|3���|w+3��I��?O=I.�[|��Ϟ���jo[M)�mڿ�U�#0�/�F��d�(�� ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104056_FL_UN_Subtel_10qMulti.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104003_iPS_cR35_+3p_Subtel_10p+18p.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104090_Fibroblast_pG_Subtel_5p.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4103995_iPS_cR35_+44p_Subtel_10p+18p.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104117_cG13-treat_Subtel_5p_OICR.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104011_iPS_cR35_+23p_Subtel_10qMulti.bed.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798204_k562_cnr_wbp11_copy1.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798205_k562_cnr_wbp11_copy2.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119516_SF11612_snATAC_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119519_SF11215_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119515_SF11949_snATAC_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119514_SF12017_snATAC_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119517_SF11979_snATAC_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119520_SF11331_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119513_SF11964_snATAC_peaks.bed.gz 1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119518_SF11956_snATAC_peaks.bed.gz 1
I added an extra except statement to skip PanicException:
Yeah, that's fine just be aware that can mess with things. If it helps now in the short term that's fine, but hopefully the changes can help out here
?
Issue is that the regions are stored as: chr1:100-200
. I make sure to check that there are three fields after splitting on \t
, otherwise we bail and raise exception.
TypeError
Issue with this is that the regions are stored with commas in the start and ends: chr10 133,785,955 133,786,173
🤦🏻♂️ This one also gets fixed with the new bail.
Garbled files?
These are resolved too.
CSV parse error
These contain headers in them that mess with things. Example: track name="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed" description="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed"
Also solved
Empty
yeah seems empty to me... 0B
I see when running du -sh
Early end?
Yeah these seem empty. Solved
I think the conclusion was to not catch these exceptions, so I wouldn't do this :).
Are you talking about Claude's code?
try:
regions = RegionSet(file)
tokens = tokenizer.tokenize(regions)
tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
write_tokens_to_gtok(tokens_file, tokens)
except Exception as e:
with open(failed_files, "a") as f:
f.write(f"{file}\t{e}\n")
except BaseException as be:
with open(failed_files, "a") as f:
f.write(f"{file}\t{be}\n")
Because on the Rust side I am bubbling up the correct exceptions that should be catchable
For example... using the new tokenizers yields this:
>>> from geniml.tokenization.main import TreeTokenizer
>>> t = TreeTokenizer.from_pretrained("databio/r2v-luecken2021-hg38-v2")
>>>
>>> try:
... t("GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g")
... except Exception as e:
... print(e)
...
The file GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g does not exist.
>>> try:
... t("GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz")
... except Exception as e:
... print(e)
...
BED file line does not have at least 3 fields: track name="..."
Which seems to indicate that we caught it correctly instead of panicking and throwing an uncatch
able Exception
@ClaudeHu can you confirm if this is solved or not?
While running this code (based on pretokenization code):
This error occurred: