Issue with none type while saving data_dict to h5

NicolasProvencher commented 6 months ago

Hi as asked i opened a new issue

I am using this gtf Homo_sapiens.GRCh38.109.chr.gtf.gz

the fasta file was creating by using concatening all Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz (for each chromosome) from the ftp https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna Here a googledrive link to it https://drive.google.com/drive/folders/12wDUc6IcYFmBs-w-e2bTtliKBXJiPBYk?usp=sharing

I had to use this file because when using Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz weird scaffolds where loaded as chromosomes into the model for trainining (I somehow was able to get to the machine learning phase with file aligned on gtf and dna containing those scaffold (i think) considered as chromosome)

the genome_name i was refering was a typo, (oups) and i was referring to gene_name in data.py line 192 sorry for the confusion

I somehow lost the log traceback and I have the parse_ribo_data runnning right now ill edit this post when its done and I'm able to reproduce the bug

in the meanwhile this is the line that was causing the issue (line 158 i think of data.py, part of the save_transcriptome_to_h5 fonction)

       grp.create_dataset(
            key, data=array, dtype=f"<S{max(1,max([len(s) for s in array]))}"
        )

more specifically the len(s), since some of the items in array are none type object for keys = gene_name, tag, and support_lvl where non type object (had to test this part in a notebook to figure it out)

As always thanks a lot with the help

Nicolas

jdcla commented 6 months ago

Hey Nicolas, Looking into it now. I would recommend simply using the primary assembly file though instead of concatenating individual fasta files. There might be a problem there. The presence of small contigs is not an issue.

jdcla commented 6 months ago

Hey Nicolas, The problem was caused by altered behavior of the updated polars package. It should be fixed now. Thank you for your valuable feedback.

NicolasProvencher commented 6 months ago

Thanks for the quick fix, I will close this issue and test it,

Happy holidays

TRISTAN-ORF / RiboTIE

Issue with none type while saving data_dict to h5 #5