GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
131 stars 23 forks source link

dataprep error and empty output files #170

Open Michael-m6A opened 1 year ago

Michael-m6A commented 1 year ago

Hi, I keep getting an error message and empty output files (data.index, data.json, data.log, data.readcount eventalign.index) when running the xpore dataprep step for genomic coordinates.

I have downloaded the latest reference fasta and gtf files for Aedes aegypti from ENSEMBL as recommended.

Please see the header of the gtf file, script, and error messages below.

Head of gtf file

head /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.55.gtf

!genome-build AaegL5

!genome-version AaegL5

!genome-build-accession GCA_002204515.1

2 VectorBase gene 97401212 97402380 . + . gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding"; 2 VectorBase transcript 97401212 97402380 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 2 VectorBase exon 97401212 97401577 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical"; 2 VectorBase CDS 97401561 97401577 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical"; 2 VectorBase start_codon 97401561 97401563 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 2 VectorBase exon 97401632 97401925 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical"; 2 VectorBase CDS 97401632 97401925 . + 1 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";

Script

!/bin/bash

#

PBS -l walltime=80:00:00

PBS -N xpore-dataprep-ENSEMBL-55-genomic-gtf-PNXP22239

PBS -j oe

PBS -A UQ-SCI-BiolSci

PBS -l select=1:ncpus=24:mem=80G

#

-location of scratch project space- dir=/scratch/project/m6a-ml-2022

-location for eventalign file on scratch project space- eventalign=${dir}/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt

-location of ENSEMBL release 55 gtf reference file to work on scratch project space- gtf=${dir}/Aedes_aegypti_lvpagwg.AaegL5.55.gtf

-location of ENSEMBL release 55 cDNA and ncRNA concatenated reference fasta file to work on scratch project space- fasta=${dir}/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa

-xpore loading module-

module load xpore/2.1

-xpore command- xpore dataprep \ --eventalign ${eventalign} \ --gtf_or_gff ${gtf} \ --transcript_fasta ${fasta} \ --out_dir ${dir}/dataprep-ENSEMBL-55-genomic-gtf-PNXP22239 \ --genome

Job log error message

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

…………………………..

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() Traceback (most recent call last): File "/sw/QFAB/miniconda3/envs/xpore_2.1/bin/xpore", line 10, in sys.exit(main()) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 753, in dataprep parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 342, in parallel_preprocess_gene n_reads, tx_ids, t2g_mapping = t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 88, in t2g tx_seq = fasta_dict[tx][0] KeyError: 'AAEL010314-RA' =>> PBS: job killed: walltime 288068 exceeded limit 288000 ########################### Job Execution History ############################# JobId:970386.tinmgr2 UserName: GroupName:qris-uq JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-PNXP22239 SessionId:35971 ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=80:00:00 ResourcesUsed:cpupercent=185,cput=06:15:32,mem=2491616kb,ncpus=24,vmem=9301936kb,walltime=80:01:08 QueueUsed:General AccountString:UQ-SCI-BiolSci ExitStatus:271 ###############################################################################

I’m not worried about the performance warning, however, the errors in line 10, 67, 753, 342, and 88 which seem to be the source of the error I don’t understand.

Any guidance would be greatly appreciated.

Many thanks, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Do you mind sharing the download links to the gtf and the fasta files, please? Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa), Absolutely. Please see the links to the fasta and gtf files below.

Index of /pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/cdna http://ftp.ensemblgenomes.org/pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/cdna/

Index of /pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/ncrna http://ftp.ensemblgenomes.org/pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/ncrna/

Please note: I have concatenated the cdna and ncrna fasta file in order to obtain one reference fasta file.

Index of /pub/metazoa/release-55/gtf/aedes_aegypti_lvpagwg http://ftp.ensemblgenomes.org/pub/metazoa/release-55/gtf/aedes_aegypti_lvpagwg/

Thanks, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Here is a python script that will convert your merged fasta file into a format that xpore can process (I tested it on the fasta and gtf above):

file=open('merged.fa','r')
outfile=open('new.fa','w')
for ln in file:
 if ln.startswith('>'):
  ln=ln.split()[0]
  print(ln)
  outfile.write(ln+'\n')
 else:
  outfile.write(ln)
outfile.close()

Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

Thank you very much for the python script to make my fasta files usable for xpore.

Please see fasta file before and after applying the python script below:

fasta file before

grep ">" /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa | head -10

AAEL012458-RD cdna chromosome:AaegL5:1:123206413:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL012458-RC cdna chromosome:AaegL5:1:122993365:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL012458-RB cdna chromosome:AaegL5:1:122788218:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL006471-RB cdna chromosome:AaegL5:3:216688037:216701429:-1 gene:AAEL006471 gene_biotype:protein_coding transcript_biotype:protein_coding description:alcohol dehydrogenase AAEL021257-RB cdna chromosome:AaegL5:1:239266392:239268653:1 gene:AAEL021257 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL021257-RA cdna chromosome:AaegL5:1:239266375:239268441:1 gene:AAEL021257 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL020088-RB cdna chromosome:AaegL5:2:97401212:97402380:1 gene:AAEL020088 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL020088-RA cdna chromosome:AaegL5:2:97401216:97402380:1 gene:AAEL020088 gene_biotype:protein_coding transcript_biotype:nontranslating_CDS AAEL008603-RG cdna chromosome:AaegL5:2:362937967:362942272:1 gene:AAEL008603 gene_biotype:protein_coding transcript_biotype:protein_coding AAEL008603-RC cdna chromosome:AaegL5:2:362937967:362942272:1 gene:AAEL008603 gene_biotype:protein_coding transcript_biotype:protein_coding

fasta file after applying python script

grep ">" /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa | head -10

AAEL012458-RD AAEL012458-RC AAEL012458-RB AAEL006471-RB AAEL021257-RB AAEL021257-RA AAEL020088-RB AAEL020088-RA AAEL008603-RG AAEL008603-RC

The fasta file looks good and I have set up a new xpore dataprep genomic coordinate run using the xpore modified fasta and original ENSEMBL gtf file.

I will let you know how the run goes.

Thanks again!

Regards, Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

The current xpore dataprep run finished but unfortunately gives a different error in line 419 and it also exceeded 40 hours of walltime. Please see the error message below.

Error message

cat xpore-dataprep-ENSEMBL-55-genomic-gtf-xpore-modified-PNXP22239.o985129 ########################### Execution Started ############################# JobId:985129.tinmgr2 UserName:user GroupName:qris-uq ExecutionHost:tn329b ############################################################################### /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

Process Consumer-2: Traceback (most recent call last): File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run result = self.task_function(*next_task_args,self.locks) *File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'** KeyError: ('AAEL023294-RA', 6259) =>> PBS: job killed: walltime 144007 exceeded limit 144000 ########################### Job Execution History ############################# JobId:985129.tinmgr2 UserName:user GroupName:qris-uq JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-xpore-modified-PNXP22239 SessionId:45949 ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=40:00:00 ResourcesUsed:cpupercent=185,cput=06:48:32,mem=2565768kb,ncpus=24,vmem=9406104kb,walltime=40:00:07 QueueUsed:General AccountString:UQ-SCI-BiolSci ExitStatus:271 ###############################################################################

Many thanks again.

Regards, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Do you mind trying to strip off the -XX in AAEL000000-XX (the -RA in AAEL023294-RA) in the first column of the eventalign.txt file using the script below (not tested)?

file=open('eventalign.txt','r')
outfile=open('modified_eventalign.txt','w')
outfile.write(file.readline())
for ln in file:
 ln=ln.split('\t')
 newln=[ln[0].split('-')[0]]+ln[1:]
 outfile.write('\t'.join(newln))
outfile.close()

Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

Thanks for your suggestion, but I am a bit reluctant to delete first column of the eventalign.txt -XX in AAEL000000-XX (the -RA in AAEL023294-RA).

The reason being is that the -RA, -RB etc. are the identifier for the different transcripts/isoforms of a gene. My thinking is that if I delete the -XX in AAEL000000-XX of the eventalign.txt file, then I would need to delete the-XX of the fasta and gtf file as well. However, in my view that would mean I am deleting all the transcripts/isoforms identifiers.

Can you think of another possible solution for the error in line 419 of the xpore python script?

Could it help applying the python script you previously provided to clean up the fasta file and do the same with the gtf file? Please see head of gtf file below.

head /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.55.gtf

!genome-build AaegL5

!genome-version AaegL5

!genome-build-accession GCA_002204515.1

2 VectorBase gene 97401212 97402380 . + . gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding"; 2 VectorBase transcript 97401212 97402380 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 2 VectorBase exon 97401212 97401577 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical"; 2 VectorBase CDS 97401561 97401577 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical"; 2 VectorBase start_codon 97401561 97401563 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 2 VectorBase exon 97401632 97401925 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical"; 2 VectorBase CDS 97401632 97401925 . + 1 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";

Thanks again! Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

I have been going through all my scripts from the beginning and just realised that I obviously used the unmodified reference fasta file to align the basecalled fastq reads using minimap2, and then Nanopolish to generate the eventalign files.

Would I be correct in saying that I need to repeat the minimap2 and Nanopolish steps using the xpore_modified reference fasta file.

Best wishes, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

You don't have to rerun your samples from the beginning using the newly modified reference fasta file.

Regarding not stripping the -XX in AAEL000000-XX: I don't think the transcript isoform is preserved when you are using the --genome flag with xpore as all transcripts of the same gene are being collapsed to the gene level, where the coordinates would be the genome coordinate instead of the transcriptome coordinate. You will not have to delete the -XX in AAEL000000-XX in the fasta and gtf files.

Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

I am very much appreciating you taking the time to explain and troubleshoot this fasta and eventalign file compatibility issue with me. Your explanation totally makes sense to me and we’ve applied the python script you provided to delete the (-XX in AAEL000000-XX) in the eventalign file. Please see the header of the xpore-modified and unmodified eventalign file below.

Job log ########################### Execution Started ############################# JobId:1040343.tinmgr2 UserName: user GroupName:xxx ExecutionHost:tn119a ############################################################################### 6092416079 /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt 6092416079 /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt ########################### Job Execution History ############################# JobId:1040343.tinmgr2 UserName:user GroupName:xx JobName:Python-xpore-modify-eventalign-file SessionId:7014 ResourcesRequested:mem=80gb,ncpus=2,place=free,walltime=12:00:00 ResourcesUsed:cpupercent=98,cput=03:56:54,mem=26952kb,ncpus=2,vmem=386740kb,walltime=04:08:10 QueueUsed:General AccountString:xx ExitStatus:0

head /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx AAEL006471-RB 90 ATAAT 6 t 530 84.32 2.712 0.00700 ATAAT 89.36 3.04 -1.46 66606 66627 AAEL006471-RB 90 ATAAT 6 t 531 87.57 1.950 0.01500 ATAAT 89.36 3.04 -0.52 66561 66606 AAEL006471-RB 91 TAATC 6 t 532 108.49 2.862 0.00433 TAATC 106.55 3.11 0.55 66548 66561 AAEL006471-RB 91 TAATC 6 t 533 111.79 0.902 0.00233 TAATC 106.55 3.11 1.48 66541 66548 AAEL006471-RB 91 TAATC 6 t 534 109.40 1.550 0.00233 TAATC 106.55 3.11 0.81 66534 66541 AAEL006471-RB 91 TAATC 6 t 535 109.79 2.267 0.00400 TAATC 106.55 3.11 0.92 66522 66534 AAEL006471-RB 92 AATCA 6 t 536 99.10 2.201 0.00167 AATCA 100.24 5.70 -0.18 66517 66522 AAEL006471-RB 92 AATCA 6 t 537 90.56 2.237 0.00267 AATCA 100.24 5.70 -1.49 66509 66517 AAEL006471-RB 92 AATCA 6 t 538 94.81 3.837 0.00667 AATCA 100.24 5.70 -0.84 66489 66509

head /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx AAEL006471 90 ATAAT 6 t 530 84.32 2.712 0.00700 ATAAT 89.36 3.04 -1.46 66606 66627 AAEL006471 90 ATAAT 6 t 531 87.57 1.950 0.01500 ATAAT 89.36 3.04 -0.52 66561 66606 AAEL006471 91 TAATC 6 t 532 108.49 2.862 0.00433 TAATC 106.55 3.11 0.55 66548 66561 AAEL006471 91 TAATC 6 t 533 111.79 0.902 0.00233 TAATC 106.55 3.11 1.48 66541 66548 AAEL006471 91 TAATC 6 t 534 109.40 1.550 0.00233 TAATC 106.55 3.11 0.81 66534 66541 AAEL006471 91 TAATC 6 t 535 109.79 2.267 0.00400 TAATC 106.55 3.11 0.92 66522 66534 AAEL006471 92 AATCA 6 t 536 99.10 2.201 0.00167 AATCA 100.24 5.70 -0.18 66517 66522 AAEL006471 92 AATCA 6 t 537 90.56 2.237 0.00267 AATCA 100.24 5.70 -1.49 66509 66517 AAEL006471 92 AATCA 6 t 538 94.81 3.837 0.00667 AATCA 100.24 5.70 -0.84 66489 66509

I have also just submitted a new xpore dataprep run using the xpore-modified fasta and eventalign file. I should have an answer by Monday and will update you on the outcome.

Have a good weekend.

Cheers, Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

The xpore-dataprep run finished without giving me an error message, only listed the usual performance warnings. Please see job log below.

cat xpore-dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239.o1042590 ########################### Execution Started ############################# JobId:1042590.tinmgr2 UserName:xx GroupName:qris-uq ExecutionHost:tn311b ############################################################################### /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() ########################### Job Execution History ############################# JobId:1042590.tinmgr2 UserName:xx GroupName:qris-uq JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239 SessionId:39824 ResourcesRequested:mem=100gb,ncpus=24,place=free,walltime=40:00:00 ResourcesUsed:cpupercent=182,cput=06:21:30,mem=1766540kb,ncpus=24,vmem=8564532kb,walltime=03:55:07 QueueUsed:General AccountString: xx ExitStatus:0 ###############################################################################

However, I've continued checking the outcome files of which the data.index data.json data.log data.readcount files are empty and only the eventalign.index contains the information of transcript_id,read_index,pos_start,pos_end, see below.

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239 data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index idx,start,end

more /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index idx,start,end

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index idx,start,end

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.json

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.json

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.log Total 0 genes. --- SUCCESSFULLY FINISHED ---

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.log Total 0 genes. --- SUCCESSFULLY FINISHED ---

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.readcount
idx,n_reads

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.readcount
idx,n_reads

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/eventalign.index transcript_id,read_index,pos_start,pos_end AAEL006471,6,172,227770 AAEL006471,10,227770,470953 AAEL006471,15,470953,734427 AAEL006471,14,734427,1002147 AAEL012458,0,1002147,1108621 AAEL006471,9,1108621,1386556 AAEL006471,5,1386556,1664122 AAEL006471,7,1664122,1940105 AAEL006471,8,1940105,2211471

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/eventalign.index AAEL023197,2964358,552753283315,552753335362 AAEL027615,2964371,552753335362,552753396536 AAEL023197,2964355,552753396536,552753454995 AAEL027615,2964370,552753454995,552753516754 AAEL027615,2964372,552753516754,552753636015 AAEL027615,2964374,552753636015,552753788122 AAEL027615,2964373,552753788122,552753936615 AAEL027615,2964367,552753936615,552754139519 AAEL027615,2964369,552754139519,552754367132 AAEL027615,2964375,552754367132,552754580657

It looks like that the xpore dataprep step worked but data.index data.json data.log data.readcount files being empty makes me a bit suspicious.

Can you please tell me what you think and if this is still an issue point me to a possible solution.

Many thanks again!

Cheers, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Do you mind using the following GTF to run xpore dataprep? https://drive.google.com/file/d/1KLikdHKevy7sDWnppo2LmazeehTrKCUy/view?usp=sharing Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

Absolutely, happy to do so. I have submitted a new xpore dataprep run using your modified gtf file. I am currently in a queue on our HPC waiting for my job submission to run. We should have an outcome by tomorrow which I’ll post again.

Thank you!

Cheers, Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

I am afraid but it looks like using the modified gft file you provided did not solved the issue. The xpore dataprep run using the modified gtf file produced the following error messages (line 10, 67, 753, 342, and 88) and the output files are empty. Please see below.

Job log

cat xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239.o1055583 ########################### Execution Started ############################# JobId:1055583.tinmgr2 UserName:xx GroupName:qris-uq ExecutionHost:tn425b ############################################################################### /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() Traceback (most recent call last): File "/sw/QFAB/miniconda3/envs/xpore_2.1/bin/xpore", line 10, in sys.exit(main()) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 753, in dataprep parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 342, in parallel_preprocess_gene n_reads, tx_ids, t2g_mapping = t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 88, in t2g tx_seq = fasta_dict[tx][0] KeyError: 'AAEL014640' =>> PBS: job killed: walltime 72045 exceeded limit 72000 ########################### Job Execution History ############################# JobId:1055583.tinmgr2 UserName:xx GroupName:qris-uq JobName:xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239 SessionId:24225 ResourcesRequested:mem=100gb,ncpus=24,place=free,walltime=20:00:00 ResourcesUsed:cpupercent=185,cput=06:15:28,mem=2215896kb,ncpus=24,vmem=9023764kb,walltime=20:00:45 QueueUsed:General AccountString:UQ-SCI-BiolSci ExitStatus:271 ###############################################################################

Inspected output files

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239 data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.index idx,start,end

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.index idx,start,end

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.json
nothing

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.log
nothing

Usually here we see Total xx genes. --- SUCCESSFULLY FINISHED ---

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/eventalign.index transcript_id,read_index,pos_start,pos_end AAEL006471,6,172,227770 AAEL006471,10,227770,470953 AAEL006471,15,470953,734427 AAEL006471,14,734427,1002147 AAEL012458,0,1002147,1108621 AAEL006471,9,1108621,1386556 AAEL006471,5,1386556,1664122 AAEL006471,7,1664122,1940105 AAEL006471,8,1940105,2211471

Thank you again.

Cheers, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

I have tried running xpore dataprep with the 10-line eventalign.txt you provided above and the following modified fasta and gtf. Here are the top 10 lines of the files:

eventalign.txt

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
AAEL006471  90  ATAAT   6   t   530 84.32   2.712   0.00700 ATAAT   89.36   3.04    -1.46   66606   66627
AAEL006471  90  ATAAT   6   t   531 87.57   1.950   0.01500 ATAAT   89.36   3.04    -0.52   66561   66606
AAEL006471  91  TAATC   6   t   532 108.49  2.862   0.00433 TAATC   106.55  3.11    0.5566548   66561
AAEL006471  91  TAATC   6   t   533 111.79  0.902   0.00233 TAATC   106.55  3.11    1.4866541   66548
AAEL006471  91  TAATC   6   t   534 109.40  1.550   0.00233 TAATC   106.55  3.11    0.8166534   66541
AAEL006471  91  TAATC   6   t   535 109.79  2.267   0.00400 TAATC   106.55  3.11    0.9266522   66534
AAEL006471  92  AATCA   6   t   536 99.10   2.201   0.00167 AATCA   100.24  5.70    -0.18   66517   66522
AAEL006471  92  AATCA   6   t   537 90.56   2.237   0.00267 AATCA   100.24  5.70    -1.49   66509   66517
AAEL006471  92  AATCA   6   t   538 94.81   3.837   0.00667 AATCA   100.24  5.70    -0.84   66489   66509

fasta

>AAEL012458
ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA
TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC
AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT
CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA
TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC
GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG
AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT
TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC
GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

gtf

2   VectorBase  gene    97401212    97402380    .   +   .   gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding";
2   VectorBase  transcript  97401212    97402380    .   +   .   gene_id "AAEL020088"; transcript_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2   VectorBase  exon    97401212    97401577    .   +   .   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical";
2   VectorBase  CDS 97401561    97401577    .   +   0   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2   VectorBase  start_codon 97401561    97401563    .   +   0   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2   VectorBase  exon    97401632    97401925    .   +   .   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical";
2   VectorBase  CDS 97401632    97401925    .   +   1   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2   VectorBase  exon    97401984    97402380    .   +   .   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E3"; tag "Ensembl_canonical";
2   VectorBase  CDS 97401984    97402101    .   +   1   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2   VectorBase  stop_codon  97402102    97402104    .   +   0   gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

My data.json output is not empty, which means that in theory your data.json should not be empty: Screenshot from 2023-01-19 14-08-39

Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

Thanks very much for running xpore dataprep with the 10-line eventalign.txt file.

I have kind of good news with one remaining barrier to overcome (error in line 419), please see below.

Fasta file Firstly, I have inspected all files (fasta, gtf, and eventalign) used in my previous xpore dataprep run using genomic coordinates and discovered that the original python script previously provided for the fasta file only removed certain header information but did not remove the transcript identification (-RA).

head /scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa

AAEL012458-RD ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

We modified your python script and removed the -RA, see below.

file=open('/scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa','r') outfile=open('/scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa','w') for ln in file: if ln.startswith('>'): ln=ln.split()[0] ln=ln.split('-')[0]

print(ln)

outfile.write(ln+'\n')

else: outfile.write(ln) outfile.close()

xpore fully modified fasta file

head /scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa

AAEL012458 ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

Xpore dataprep run – genomic coordinates using the fully modified fasta, gtf, and eventalign files

Now to the kind of good news, following the xpore-dataprep run using the fully xpore-modified fasta, gtf, and eventalign files all the outcome files contain expected data, see below.

However, the xpore-dataprep run still gets stuck at around 6 hours and in the job, log gave me this error message, also see below.

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239 data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.index idx,start,end AAEL019853,0,84036 AAEL008063,84036,792134 AAEL001394,792134,3104813 AAEL018094,3104813,3124935 AAEL004980,3124935,4075246 AAEL027344,4075246,4079148 AAEL010082,4079148,4125980 AAEL012245,4125980,4159251 AAEL025417,4159251,4168695

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.json {"AAEL019853":{"64326275":{"AGAAA":[87.0]},"64326276":{"CAGAA":[111.1]},"64326277":{"TCAGA":[114.1]},"64326278":{"ATCAG":[80.9]},"64326279":{"TATCA":[93.4]},"64326280":{"GTATC":[84.0]},"64326281":{"AGTAT":[119.0]},"64326282":{"GAGTA":[115.2]},"64326283":{"TGAGT":[119.2]},"64326284":{"TTGAG":[95.2]},"64326285":{"ATTGA":[92.0]},"64326288":{"AAGAT":[122.5]},"64326289":{"GAAGA":[114.1]},"64326290":{"AGAAG":[135.5]},"64326291":{"AAGAA":[120.0]},"64326292":{"AAAGA":[109.1]},"64326293":{"TAAAG":[108.6]},"64326294":{"ATAAA":[90.9]},"64326295":{"AATAA":[98.8]},"64326296":{"GAATA":[115.7]},"64326297":{"AGAAT":[134.7]},"64326298":{"AAGAA":[124.7]},"64326299":{"TAAGA":[119.4]},"64326300":{"ATAAG":[78.9]},"64326301":{"TATAA":[80.0]},"64326302":{"GTATA":[92.0]},"64326303":{"GGTAT":[90.0]},"64326304":{"TGGTA":[121.6]},"64326305":{"ATGGT":[97.0]},"64326306":{"AATGG":[85.9]},"64326307":{"GAATG":[104.4]},"64326308":{"GGAAT":[124.7]},"64326309":{"TGGAA":[112.8]},"64326310":{"TTGGA":[96.9]},"64326311":{"CTTGG":[92.7]},"64326312":{"TCTTG":[80.1]},"64326313":{"TTCTT":[78.2]},"64326314":{"TTTCT":[80.0]},"64326315":{"TTTTC":[76.5]},"64326316":{"GTTTT":[76.9]},"64326317":{"GGTTT":[92.7]},"64326318":{"AGGTT":[124.7]},"64326319":{"AAGGT":[120.7]},"64326320":{"AAAGG":[109.1]},"64326321":{"GAAAG":[100.7]},"64326322":{"TGAAA":[111.4]},"64326323":{"ATGAA":[90.4]},"64326324":{"GATGA":[78.1]},"64326325":{"GGATG":[130.5]},"64326326":{"TGGAT":[128.8]},"64326327":{"ATGGA":[104.3]},"64326328":{"GATGG":[84.7]},"64326329":{"GGATG":[118.2]},"64326330":{"TGGAT":[128.5]},"64326331":{"GTGGA":[100.1]},"64326332":{"GGTGG":[120.0]},"64326333":{"GGGTG":[128.1]},"64326334":{"TGGGT":[115.8]},"64326335":{"CTGGG":[97.7]},"64326336":{"GCTGG":[89.4]},"64326337":{"TGCTG":[99.8]},"64326338":{"TTGCT":[98.5]},"64326339":{"GTTGC":[85.4]},"64326340":{"CGTTG":[93.0]},"64326341":{"GCGTT":[84.0]},"64326342":{"TGCGT":[94.9]},"64326343":{"ATGCG":[87.3]},"64326344":{"AATGC":[76.7]},"64326345":{"TAATG":[93.4]},"64326346":{"ATAAT":[89.3]},"64326347":{"AATAA":[94.4]},"64326348":{"GAATA":[112.9]},"6432

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log AAEL019853: True AAEL008063: True AAEL001394: True AAEL018094: True AAEL004980: True AAEL027344: True AAEL010082: True AAEL012245: True AAEL025417: True AAEL004472: True

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.readcount idx,n_reads AAEL019853,6 AAEL008063,63 AAEL001394,192 AAEL018094,1 AAEL004980,100 AAEL027344,3 AAEL010082,2 AAEL012245,2 AAEL025417,1

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/eventalign.index transcript_id,read_index,pos_start,pos_end AAEL006471,6,172,227770 AAEL006471,10,227770,470953 AAEL006471,15,470953,734427 AAEL006471,14,734427,1002147 AAEL012458,0,1002147,1108621 AAEL006471,9,1108621,1386556 AAEL006471,5,1386556,1664122 AAEL006471,7,1664122,1940105 AAEL006471,8,1940105,2211471

Running time when xpore dataprep job got stuck

Job id Name User Time Use S Queue


1067207.tinmgr2 xpore-dataprep- xx 06:27:24 R General

Job log cat xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239.o1067207 ########################### Execution Started ############################# JobId:1067207.tinmgr2 UserName:xx GroupName:qris-uq ExecutionHost:tn313c ############################################################################### /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() Process Consumer-2: Traceback (most recent call last): File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run result = self.task_function(*next_task_args,self.locks) *File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2984) =>> PBS: job killed: walltime 72066 exceeded limit 72000** ########################### Job Execution History ############################# JobId:1067207.tinmgr2 UserName:xx GroupName:qris-uq JobName:xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239 SessionId:20878 ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=20:00:00 ResourcesUsed:cpupercent=184,cput=06:27:24,mem=2496020kb,ncpus=24,vmem=9283068kb,walltime=20:01:06 QueueUsed:General AccountString:UQ-SCI-BiolSci ExitStatus:271 ###############################################################################

It appears that there is still an issue with the genomic coordinate conversion line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' “ which I don’t quite understand as we have removed all the transcript identification data (i.e. -RA).

Given that it specifically listed KeyError: ('AAEL024786', 2984), I have inspected the modified gtf and eventalign.txt files.

grep AAEL024786 /scratch/project_mnt/xx/xpore_modified_Aedes_aegypti_lvpagwg.AaegL5.55.gtf 1 VectorBase gene 86105077 86118053 . - . gene_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; 1 VectorBase transcript 86105077 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 1 VectorBase exon 86117901 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E1"; tag "Ensembl_canonical"; 1 VectorBase CDS 86117901 86118051 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical"; 1 VectorBase exon 86117589 86117828 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E2"; tag "Ensembl_canonical"; 1 VectorBase CDS 86117589 86117828 . - 2 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical"; 1 VectorBase exon 86117354 86117511 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E3"; tag "Ensembl_canonical"; 1 VectorBase CDS 86117354 86117511 . - 2 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical"; 1 VectorBase exon 86105077 86107509 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E4"; tag "Ensembl_canonical"; 1 VectorBase CDS 86107264 86107509 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical"; 1 VectorBase stop_codon 86107261 86107263 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 1 VectorBase five_prime_utr 86118052 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 1 VectorBase three_prime_utr 86105077 86107260 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

Contains AAEL024786

Data log

grep AAEL024786 /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log

wc -l /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log 133 /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log

modified-eventalign – search for AAEL024786 (mentioned in error message)

grep AAEL024786 /scratch/project_mnt/xx/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt

AAEL024786 304 TCGGT 1708964 t 9 96.18 3.014 0.00500 TCGGT 95.29 4.98 0.16 52031 52046 AAEL024786 304 TCGGT 1708964 t 10 98.23 1.941 0.00233 TCGGT 95.29 4.98 0.53 52024 52031 AAEL024786 304 TCGGT 1708964 t 11 94.83 3.748 0.00433 TCGGT 95.29 4.98 -0.08 52011 52024 AAEL024786 305 CGGTC 1708964 t 12 118.01 5.639 0.01100 CGGTC 117.18 3.37 0.22 51978 52011 AAEL024786 305 CGGTC 1708964 t 13 119.51 8.057 0.02167 CGGTC 117.18 3.37 0.62 51913 51978 ………….

AAEL024786 3900 GTATA 1709033 t 1308 86.73 1.652 0.00700 GTATA 89.66 2.63 -0.87 2336 2357 AAEL024786 3901 TATAA 1709033 t 1309 83.04 1.997 0.00467 TATAA 92.84 6.28 -1.22 2322 2336 AAEL024786 3902 ATAAT 1709033 t 1310 82.56 2.438 0.00633 ATAAT 89.36 3.04 -1.75 2303 2322 AAEL024786 3902 ATAAT 1709033 t 1311 87.65 1.897 0.00967 ATAAT 89.36 3.04 -0.44 2274 2303 AAEL024786 3902 ATAAT 1709033 t 1312 86.09 2.294 0.02900 ATAAT 89.36 3.04 -0.84 2187 2274 AAEL024786 3902 ATAAT 1709033 t 1313 72.46 2.759 0.00933 NNNNN 0.00 0.00 inf 2159 2187 AAEL024786 3913 ATTAT 1709033 t 1314 63.95 1.483 0.00500 ATTAT 85.44 2.46 -6.83 2144 2159

Contains AAEL024786, however, data.log shows that it only progressed until 133 and then produced the error message.

I am very appreciative of all your hep and time invested in resolving the xpore dataprep genomic coordinate issue, and we’re getting close. Please have a look at the error message and my attempted file investigation, I am hopeful that the error message in line 419 is resolvable.

Thanks again!

Cheers, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Sorry for the delayed reply! (It was Chinese New Year out here)

The transcript_id AAEL024786 is there but position 2984 is not found in AAEL024786, as the annotation dictionary shows that the last nucleotide of the transcript is at position 2983:

{'chr': '1', 'g_id': 'AAEL024786', 'strand': '-', 'transcript': (86105077, 86118053),
 'exon': [(86117901, 86118053), (86117589, 86117828), (86117354, 86117511), (86105077, 86107509)], 
 'tx_exon': [(0, 152), (153, 392), (393, 550), (551, 2983)]}

I think this is more of a 0-base vs 1-base problem, which doesn't quite exist in widely used genomes (e.g. human, mouse, and Arabidopsis) from the same reference source.

I have edited the function in the michael branch, and you can try installing and running it to see whether it solves your problem:

git clone https://github.com/GoekeLab/xpore.git
git checkout origin/michael
sudo python3 setup.py install

The annotation for AAEL024786 should include position 2984:

{'chr': '1', 'g_id': 'AAEL024786', 'strand': '-', 'transcript': (86105077, 86118053), 
 'exon': [(86117901, 86118053), (86117589, 86117828), (86117354, 86117511), (86105077, 86107509)], 
 'tx_exon': [(1, 153), (154, 393), (394, 551), (552, 2984)]}

Thanks!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

No problem at all! My apologies for not responding to you for the last 3 weeks. I was on leave as we welcomed a new baby girl into our family.

I am incredible thankful to you for providing me with this edited function (michael branch).

I have requested this edited function to be installed on our institutional HPC server, unfortunately, that is not the fastest process.

I will keep you up to date as soon as I have any developments.

Thanks again, Michael

yuukiiwa commented 1 year ago

Hi Michael (@Michael-m6A),

Congratulations!! Take your time!

Best wishes, Yuk Kei

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa), Just want to give you a quick update. I am experiencing more delays in being able to test your xPore edited function (michael branch) as our organisation permanently decommissioned the HPC infrastructure/server I was using. The migration to the new HPC environment is underway but none of the tools/modules I need to use are installed yet. I am working on getting everything I required for the m6A analysis to be installed.

I will post here again once I have made actual progress with the analysis using the xPore edited function (michael branch) you kindly provided.

Thanks, Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

We could finally install your modified xpore version on the new HPC infrastructure.

I have used fasta, eventalign, and gtf (containing cDNA and ncRNA information) files without (-RA) and submitted the xpore-developer dataprep job.

However, it did produce a similar error message like we got previously, with the difference that it is now the position leading to the error in the gene 'AAEL024786' is 2985 instead of previously 2984, please see below.

Error message modified xpore developer

File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Error message standard xpore *File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2984) =>> PBS: job killed: walltime 72066 exceeded limit 72000

Error message modified xpore developer in detail

-rw-r--r--. 1 xx qris-uq 1433120 Apr 16 04:34 s3535379_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error -rw-r--r--. 1 xx qris-uq 0 Apr 14 12:04 s3535379_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.out

cat xx_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error

[xx@bunya3 ~]$ cat xx_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error /home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum()

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) /home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() Process Consumer-2: Traceback (most recent call last): File "/sw/auto/rocky8.6/epyc3/software/Python/3.9.5-GCCcore-10.3.0-bare/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/helper.py", line 113, in run result = self.task_function(next_task_args,self.locks) File "/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985) slurmstepd: error: JOB 3535379 ON bun031 CANCELLED AT 2023-04-16T04:34:40 DUE TO TIME LIMIT

Error message standard xpore in detail

Traceback (most recent call last): File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run result = self.task_function(next_task_args,self.locks) File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2984) =>> PBS: job killed: walltime 72066 exceeded limit 72000 ########################### Job Execution History #############################

I have also inspected all output files and all of them contain data outputs, please see below.

Output:modified xpore developer run on 17-04-2023

drwxr-sr-x. 2 user Q5334RW 4096 Apr 14 16:58 dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239

ls /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239

data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.index

idx,start,end AAEL017263,0,2030676 AAEL012250,2030676,2799959 AAEL021326,2799959,2820651 AAEL017012,2820651,5097858 AAEL014368,5097858,5571925 AAEL002886,5571925,14939749 AAEL009400,14939749,15243925 AAEL001846,15243925,15257916 AAEL019615,15257916,15292102

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.json

{"AAEL017263":{"78078919":{"GGAAA":[119.4,115.5,71.2,116.0,104.5,106.4,68.5,107.8,109.7,104.6,105.7,113.8,114.0,102.8,109.9,108.3,108.4,111.4,90.6,110.3,112.1,111.5,102.0,111.2,110.9,111.5,108.6,108.7,85.0,111.6,109.6,107.9,109.9,107.0,108.1,100.4,112.4,107.3,111.4,86.5,109.9,107.9,111.4,106.1,110.9,110.9,113.9,105.8,110.1,90.9,101.3,86.8,109.4,110.8,85.7,108.1,108.0,110.7,109.1,70.0,117.4,113.6,111.0,106.0,93.5,108.7,110.4,111.5,93.4,88.2,68.2,108.3,111.6,112.1,112.8,109.4,111.3,108.4,109.7,91.1,113.5,116.7,70.6,108.9,111.7,109.4,105.5,91.4,110.1,108.5,100.0,103.7,108.5,121.4,110.5,103.7,91.5,86.5,107.7,109.0,108.9,92.1,103.3,112.3,106.6,94.1,85.5,110.8,108.9,73.7,107.8,110.0,108.8,91.3,110.5,111.7,112.9,109.1,110.8,64.4,108.3,109.0,106.6,109.9,103.3,108.4,107.6,108.0,115.1,113.1,110.6,108.0,106.3,111.9,87.6,105.6,109.4,111.5,107.5,108.6,110.6,110.9,97.4]},"78078920":{"TGGAA":

etc.

TAT":[84.1]},"84837":{"GAATA":[111.4]},"84838":{"AGAAT":[128.7]},"84839":{"CAGAA":[121.1]},"84840":{"GCAGA":[95.8]},"84841":{"CGCAG":[92.5]},"84842":{"CCGCA":[84.0]},"84843":{"GCCGC":[76.5]},"84844":{"TGCCG":[99.9]},"84845":{"ATGCC":[83.0]},"84846":{"CATGC":[74.0]},"84847":{"ACATG":[80.4]},"84848":{"AACAT":[89.0]},"84849":{"CAACA":[91.6]},"84850":{"GCAAC":[83.1]},"84851":{"GGCAA":[110.9]},"84852":{"GGGCA":[105.4]},"84853":{"TGGGC":[113.5]},"84854":{"ATGGG":[97.8]},"84855":{"TATGG":[84.7]},"84856":{"ATATG":[85.9]},"84857":{"AATAT":[92.1]},"84858":{"CAATA":[112.4]},"84859":{"CCAAT":[90.0]},"84860":{"GCCAA":[74.3]},"84861":{"GGCCA":[103.8]},"84862":{"TGGCC":[110.0]},"84863":{"CTGGC":[104.3]},"84864":{"GCTGG":[89.3]}}}

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.log

AAEL017263: True AAEL012250: True AAEL021326: True AAEL017012: False AAEL014368: True AAEL002886: True AAEL009400: False AAEL001846: True AAEL019615: True AAEL024674: True

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.readcount

idx,n_reads AAEL017263,245 AAEL012250,115 AAEL021326,7 AAEL017012,326 AAEL014368,124 AAEL002886,1001 AAEL009400,267 AAEL001846,1 AAEL019615,3

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/eventalign.index

transcript_id,read_index,pos_start,pos_end AAEL006471,6,172,227770 AAEL006471,10,227770,470953 AAEL006471,15,470953,734427 AAEL006471,14,734427,1002147 AAEL012458,0,1002147,1108621 AAEL006471,9,1108621,1386556 AAEL006471,5,1386556,1664122 AAEL006471,7,1664122,1940105 AAEL006471,8,1940105,2211471

Any help will be greatly appreciated again.

Many thanks, Michael

Michael-m6A commented 1 year ago

Hi Yuk (@yuukiiwa),

Would it be possible to get your help on this xpore dataprep genomic coordinate error?

I have used the modified version (michael branch) you kindly provided to me in January but still get an error message one position further down the line.

Error message modified xpore developer File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Please see further details and the previous error message in the post from the 19th of April above.

Many thanks in advance, Michael

Michael-m6A commented 11 months ago

Hi Ploy (@ploy-np), Yuk (@yuukiiwa), and Jonathan (@jonathangoeke),

I am writing to all of you in the hope to get your help with the xpore dataprep genomic coordinate error.

Please see the error message in full in my previous post from 19th of April above where I used the modified version (michael branch) you kindly provided.

Error message modified xpore developer File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Thank you in advance, Michael

lyj95618 commented 9 months ago

Hi Ploy (@ploy-np), Yuk (@yuukiiwa), and Jonathan (@jonathangoeke),

I am writing to all of you in the hope to get your help with the xpore dataprep genomic coordinate error.

Please see the error message in full in my previous post from 19th of April above where I used the modified version (michael branch) you kindly provided.

Error message modified xpore developer File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Thank you in advance, Michael

Hi Michael, I am wondering if you have resolved this issue or have any update on it. I am currently running Xpore and experiencing a very similar issue as you.

Thanks, Laur

Michael-m6A commented 8 months ago

Hi Laur (@lyj95618) and Yuk (@yuukiiwa),

Unfortunately, I have not been able to resolve the xPore dataprep converting transcriptome positions to genomic coordinates issue.

However, I did investigate if they are variations between the reference annotation files and the number of exons from two different databases.

I have compared the Aedes aegypti annotation files from ENSEMBL and VectorBase for the gene “AAEL024786” which produced the error in xPore dataprep for me.

The location coordinates and number of exons for “AAEL024786” are identical in both ENSEMBL and VectorBase database annotation files.

My conclusion so far is that the source of the problem is within the xPore transcriptome to genomic coordinates (t2g_mapping) code, see error message which is the same as your error below.

File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Please let me know if you find a way to overcome this xPore dataprep genomic coordinates problem.

Many thanks, Michael

lyj95618 commented 8 months ago

Hi Michael @Michael-m6A,

I have also done some investigations on the errors and found that it was the issue in the transcriptome to genomic coordinates conversion code.

I tried to run Xpore without the --genome flag which meant I didn't want the transcriptome to genomic coordinates conversion and the whole thing finished smoothly without any error message. The only problem with this is the output location is in transcriptomic, not genomic.

I might go to try to download the most recent version of Ensembl cdna and GTF and see if that helps to overcome the genomic coordinate issue.

Thanks! Laur

Michael-m6A commented 8 months ago

Hi Laur (@lyj95618),

Thank you very much for sharing!

I also run xPore without the -–genome flag which worked well and gave me the locations within the transcriptome.

However, being able to get the genomic coordinates of the modifications would be really valuable. Hopefully, the @GoekeLab can fix the bug/error in their xPore transcriptome to genomic coordinates conversion code.

Thanks, Michael