GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
103 stars 19 forks source link

m6anet-dataprep gives NaN as output #7

Closed FraPria closed 3 years ago

FraPria commented 3 years ago

Hello,

I ran nanopolish and m6anet-dataprep but the data.json contains NaN:

{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"315":{"CAGACAG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"322":{"GGGACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"349":{"GAGACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN],[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"381":{"GTGACAG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN],[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"488":{"TGGACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}

EventAlign output doesn’t seem to be wrong:

contig  position        reference_kmer  read_index      strand  event_index     event_level_mean        event_stdv      event_length    model_kmer      model_mean      model_stdv      standardized_level
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000 273     AGCTG   0       t       8       109.36  7.133   0.00498 AGCTG   117.44  3.55    -1.99
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000 274     GCTGT   0       t       9       87.27   1.491   0.00432 GCTGT   84.35   2.85    0.89
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000 275     CTGTT   0       t       10      102.96  2.044   0.00398 CTGTT   102.18  3.78    0.18
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000 276     TGTTA   0       t       11      111.84  6.158   0.00664 TGTTA   106.43  7.49    0.63

I tried to run m6anet-run_inference and the output is missing the probability_modified column:

transcript_id,transcript_position,n_reads,probability_modified
316b811d-5a1e-46ad-bcc6-969d81d03759_chr1:566000,19,40,
316b811d-5a1e-46ad-bcc6-969d81d03759_chr1:566000,78,23,
316b811d-5a1e-46ad-bcc6-969d81d03759_chr1:566000,86,847,
316b811d-5a1e-46ad-bcc6-969d81d03759_chr1:566000,147,821,
316b811d-5a1e-46ad-bcc6-969d81d03759_chr1:566000,156,867,

Here the commands I used:

$nanopolish eventalign \
    --reads $FASTQ \
    --bam $BAM \
    --genome $REF \
    --scale-events -t $NCPUS > $WORKDIR/$id.eventalign.txt
m6anet-dataprep --eventalign $WORKDIR/$id.eventalign.txt \
                --out_dir $WORKDIR --n_processes $NCPUS

m6anet-inference --input_dir $WORKDIR --out_dir $WORKDIR --n_processes $NCPUS

I got no errors during the installation process

Could you help me understand this error please? Thank you in advance for your help

chrishendra93 commented 3 years ago

Hi @FraPria, thanks for raising this issue. Can I check whether data.json contains all NaN, or if those are the only entries with NaN? Can you show me the output of data.index as well? If data.index output is fine, then do you think you can give me a minimal reproducible example that results in this error? Perhaps you can extract eventalign outputs only from those positions with NaN entries

I don't expect m6anet-run_inference to run successfully given those outputs but thanks for raising this, in the future I will include some checks to make sure that it will not run if there is a problem with data.json

FraPria commented 3 years ago

Hi @chrishendra93, thanks for your answer!

the output of data.index is like this:

transcript_id,transcript_position,start,end
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000,2170,0,113
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000,3061,113,226
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000,3153,226,377
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000,3159,377,528
68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000,3565,528,641
c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463,101,641,986
c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463,156,986,1293
c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463,194,1293,1600
c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463,228,1600,2021

Here I attach a subset of the eventalign output eventalign.txt

Running the command: m6anet-dataprep --eventalign eventalign.txt --out_dir .

I get this output:

{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"2170":{"GAGACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"3153":{"GTGACCC":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"68ebd370-cef0-4b97-bd1b-825d623446e2_chr1:18000":{"3159":{"CTGACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"101":{"AAGACAC":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"194":{"CTGACCA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"228":{"AAAACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"303":{"ATGACAG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"398":{"GTAACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"418":{"TAAACTG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"c6e29c47-9e37-427a-81d5-70e87a061505_ENSG00000228463":{"436":{"CTAACTC":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}

It outputs also a warning:

site-packages/m6anet-0.1.1-py3.7.egg/m6anet/scripts/dataprep.py:143: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Thank you

chrishendra93 commented 3 years ago

Thank you for the response! After checking your eventalign.txt closely, it seems that your nanopolish run does not produce the column start_idx and end_idx that is required by m6anet to preprocess the file. You need to include --signal-index option in your nanopolish eventalign command, let me know if it is working for you!

FraPria commented 3 years ago

That was the problem, now it gives the expected output, thanks a lot!

chrishendra93 commented 3 years ago

Hi @FraPria, glad that you can run the model successfully. I will update the documentation with this instruction just in case so that people know what commands to supply to Nanopolish in order to run the model successfully. I will now close this issue, let us know if you have any more problems running the model!