GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
131 stars 23 forks source link

xpore dataprep indexing error: PerformanceWarning: indexing past lexsort depth may impact performance. #156

Closed kwonej0617 closed 1 year ago

kwonej0617 commented 1 year ago

Hi, developer. Thank you for making a useful tool. I have a question regarding the error message from the error log file of xpore dataprep.

I run guppy, minimap, nanopolish sucessfully using HEK293 WT/KO nanopore data from your study. Then, the nanopolish output was used for the dataprep of xpore. The job was successfully completed as noted in the output log file and the output of dataprep looks successfully done. However, the error log file have the message below.

/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()

At the end of the log file, it says,

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()

For xpore dataprep, I run this code, xpore dataprep --eventalign HEK293T-WT-rep2/nanopolish/eventalign.txt --gtf_or_gff reference/Homo_sapiens.GRCh38.107.gtf --transcript_fasta reference/Homo_sapiens.GRCh38.cdna.all_ncRNA.fa --out_dir HEK293T-WT-rep2/dataprep/ --genome

Could you check if I used the wrong input files format? Nanopolish output

contig  position        reference_kmer  read_index      strand  event_index     event_level_mean        event_stdv      event_length    model_kmer      model_mean      model_stdv      standardized_level      start_idx       end_idx
ENST00000361390.2       0       ATACC   11      t       8       93.20   6.032   0.00664 NNNNN   0.00    0.00    inf     76731   76751
ENST00000361390.2       0       ATACC   11      t       9       112.76  4.234   0.00266 NNNNN   0.00    0.00    inf     76723   76731
ENST00000361390.2       0       ATACC   11      t       10      95.80   7.344   0.00432 NNNNN   0.00    0.00    inf     76710   76723
ENST00000361390.2       3       CCCAT   11      t       11      76.61   0.936   0.00365 CCCAT   73.36   2.11    1.33    76699   76710
ENST00000361390.2       4       CCATG   11      t       12      83.71   1.977   0.00598 CCATG   82.78   2.13    0.38    76681   76699

Homo_sapiens.GRCh38.107.gtf GTF file

#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession GCA_000001405.28
#!genebuild-last-updated 2022-04
1       ensembl_havana  gene    1471765 1497848 .       +       .       gene_id "ENSG00000160072"; gene_version "20"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1       ensembl_havana  transcript      1471765 1497848 .       +       .       gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; tag "basic";
1       ensembl_havana  exon    1471765 1472089 .       +       .       gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; exon_id "ENSE00003889014"; exon_version "1"; tag "basic";

Homo_sapiens.GRCh38.cdna.all_ncRNA.fa (Combined cDNA and ncRNA fasta files downloaded from gencode

>ENST00000632248.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142421573:142422090:1 gene:ENSG00000282618.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV10-1 description:T cell receptor beta variable 10-1 [Source:HGNC Symbol;Acc:HGNC:12177]
ACTGAGAGCCCAACTTCAGTCTGCCCACAGCAGGGCTGGGAGACACAAGATCCTGCCCTG
GAGCTGAAATGGGCACGAGGCTCTTCTTCTATGTGGCCCTTTGTCTGCTGTGGGCAGGAC
ACAGGGATGCTGAAATCACCCAGAGCCCAAGACACAAGATCACAGAGACAGGAAGGCAGG
TGACCTTGGCGTGTCACCAGACTTGGAACCACAACAATATGTTCTGGTATCGACAAGACC
TGGGACATGGGCTGAGGCTGATCCATTACTCATATGGTGTTCACGACACTAACAAAGGAG
AAGTCTCAGATGGCTACAGTGTCTCTAGATCAAACACAGAGGACCTCCCCCTCACTCTGG
AGTCTGCTGCCTCCTCCCAGACATCTGTATATTTCTGCGCCAGCAGTGAGTC
>ENST00000633313.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142476873:142477334:1 gene:ENSG00000282756.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV7-4 description:T cell receptor beta variable 7-4 [Source:HGNC Symbol;Acc:HGNC:12238]
ATGGGCACCAGGCTCCTCTGCTGGGTGGTCCTGGGTTTCCTAGGGACAGATCACACAGGT
GCTGGAGTCTCCCAGTCCCCAAGGTACAAAGTCGCAAAGAGGGGACGGGATGTAGCTCTC
AGGTGTGATTCAATTTCGGGTCATGTAACCCTTTATTGGTACCGACAGACCCTGGGGCAG
GGCTCAGAGGTTCTGACTTACTCCCAGAGTGATGCTCAACGAGACAAATCAGGGCGGCCC
AGTGGTCGGTTCTCTGCAGAGAGGCCTGAGAGATCCGTCTCCACTCTGAAGATCCAGTGC
ACAGAGCAGGGGGACTCAGCTGTGTATCTCTGTGCCAGCAGCTTAGC

It would be really helpful if you could give me an advice or suggestion! Thank you!

yuukiiwa commented 1 year ago

Hi @kwonej0617,

The lines in your error.log file are just warnings that we didn't suppress, so your run should be fine.

/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ek81w/.conda/envs/xpore_2.1/lib/python3.10/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()

Thanks!

Best wishes, Yuk Kei

kwonej0617 commented 1 year ago

Thank you so much!

yuukiiwa commented 1 year ago

No problem!