Kuanhao-Chao / splam

✂️ Deep learning-based splice site predictor that improves spliced alignments
http://ccb.jhu.edu/splam/
31 stars 1 forks source link

Splam for non-human species, new genome (no assembly report) #2

Closed estolle closed 11 months ago

estolle commented 1 year ago

Hi there

I have a similar problem than the person in the other issue: non.human species. Its not a NCBI/REFSEQ genome, but a newly assembled genome, hence we do not have an assembly report. It would be extremely convenient for all non-human/model species users if splam could utilize something more generic, say a .fai

I already tried to create a fake assembly report but could you specify which parts have to be present and how they are expected to be formatted?

this is the error

[Info] Chromosomes in the annotation file is in 'NCBI RefSeq' style Traceback (most recent call last): File "/home/ek/virtualenvs/splam/bin/splam", line 8, in sys.exit(main()) File "/home/ek/virtualenvs/splam/lib/python3.8/site-packages/splam/main.py", line 203, in main donor_bed, acceptor_bed = parse.create_donor_acceptor_bed(junction_bed, outdir, assembly_report) File "/home/ek/virtualenvs/splam/lib/python3.8/site-packages/splam/parse.py", line 83, in create_donor_acceptor_bed if donor_e >= chrs[chr] or acceptor_e >= chrs[chr]: KeyError: 'scaffold_1'

my fake assembly file (

scaffold_1 unplaced-scaffold na na xxxxxx.1 = yyyyyy.1 zzzzz 37766881 chr1 scaffold_2 unplaced-scaffold na na xxxxxx.1 = yyyyyy.1 zzzzz 36484387 chr2 scaffold_3 unplaced-scaffold na na xxxxxx.1 = yyyyyy.1 zzzzz 30962093 chr3 scaffold_4 unplaced-scaffold na na xxxxxx.1 = yyyyyy.1 zzzzz 29900042 chr4

junction.bed file:

scaffold_1 4213 4613 JUNC00000001 16 + scaffold_1 4683 5687 JUNC00000002 5 + scaffold_1 5106 5432 JUNC00000003 13 - scaffold_1 6460 8830 JUNC00000004 12 +

estolle commented 1 year ago

reformatting the fake assembly file didnt fix the issue: scaffold_1 assembled-molecule 1 Chromosome XX000001.1 = NC_000001.1 CfrieseiERGA 37766881 chr1 scaffold_2 assembled-molecule 2 Chromosome XX000002.1 = NC_000002.1 CfrieseiERGA 36484387 chr2 scaffold_3 assembled-molecule 3 Chromosome XX000003.1 = NC_000003.1 CfrieseiERGA 30962093 chr3 scaffold_4 assembled-molecule 4 Chromosome XX000004.1 = NC_000004.1 CfrieseiERGA 29900042 chr4

its still failing due to what it seems unexpected chromosome names in the fasta/bam/bed files: "scaffold_1"

Any suggestion how to make splam accept this? Otherwise we cannot use splam for any new genome (and there are alot coming)

Thanks

Kuanhao-Chao commented 11 months ago

Hi @estolle,

Thanks for raising this issue. I have just released a new version. You can check it out here: https://github.com/Kuanhao-Chao/splam/tree/v1.0.3. Now, there's no need to provide an assembly_report file. Splam directly reads the length of each chromosome from the FASTA file.

Feel free to let us know if you encounter any issues running Splam v1.0.3.

Kuan-Hao