PerformanceWarning & AssertionError in m6anet-dataprep

YCCHEN23 commented 2 years ago

Hi,

I performed m6anet-dataprep few days ago with following commands:

(base) ycc@ycc-UBUNTU:~$ m6anet-dataprep --eventalign ~/Desktop/WT_R2/WT_Dark_R2_eventalign.txt \
>                 --out_dir ~/Desktop/WT_R2/m6anet/ \
>                 --n_processes 8

Then I got the following error messages, the script was pausing with AssertionError

/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:100: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py:142: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
Process Consumer-10:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-16:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-12:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-9:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-13:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-15:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-14:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError
Process Consumer-11:
Traceback (most recent call last):
  File "/home/ycc/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/ycc/anaconda3/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 304, in preprocess_tx
    assert(len(kmer) == 1)
AssertionError

The script just stuck at here for ~24 hours, so I interrupt the program. And I only got 25 genes in my data.log

Screenshot from 2021-10-12 10-42-26

Willing to provide more details if needed, thanks!

Best regards YCCHEN

chrishendra93 commented 2 years ago

Thanks for posting the issue. It seems that the program does not automatically terminate after triggering the assertion error. I will update this on the next version. Nevertheless, the assertion triggered because there is a position in the eventalign.txt that contains more than one possible 5-mer. Do you know if it is possible with the reference you are using? Also can you show me the first few row of eventalign.txt?

Thank you!

Regards

Christopher Hendra

YCCHEN23 commented 2 years ago

Thanks for prompt reply

Here are the reference & software & commands I performed before using m6anet-dataprep:

First, I extracted the transcript fasta from genome.fa with gffread.

Reference Genome: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/fasta/arabidopsis_thaliana/dna/ (version: 05-Mar-2021 11:02)
Reference GTF file: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gtf/arabidopsis_thaliana/ (version: 08-Mar-2021 22:31)

### (I added "chr" to each chromosome name in both genome.fa & gtf. ( ex: 1 -> chr1 ))

awk '{ if($1 !~ /^#|[A-Z]/){print "chr"$0} else{print $0} }' Arabidopsis_thaliana.TAIR10.51.gtf > renamed_chromosome.gtf

# gffread version 0.12.1
gffread -w /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
        -g /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/gDNA/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
        /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/gff3/renamed_chromosome.gtf

Then aligned the merged fastq reads (cat all the passed reads from Guppy basecalling) to the transcript fasta and performed nanopolish.

#minimap2 version 2.17-r941
#samtools version 1.9
#nanopolish version 0.13.3

minimap2 -ax splice -uf -k14 -t 8 /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
         ~/Desktop/WT_R2/WT2_merged.fastq | samtools view -bh -F 2324 | samtools sort -O bam > ~/Desktop/WT_R2/WT_R2_filtered.bam

nanopolish index -s /media/ycc/Transcend/Col_Dark_R2/sequencing_summary.txt \
                 -d /media/ycc/Transcend/Col_Dark_R2/workspace/ \
                 ~/Desktop/WT_R2/WT2_merged.fastq

nanopolish eventalign --reads ~/Desktop/WT_R2/WT2_merged.fastq \
                      --bam ~/Desktop/WT_R2/WT_R2_filtered.bam \
                      --genome /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
                      --scale-events \
                      -t 4 \
                      --progress \
                      --signal-index > ~/Desktop/WT_R2/WT_Dark_R2_eventalign.txt

Here's a part of my eventalign results.

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
AT1G01010.1 9   GATAT   1   t   3   70.52   5.003   0.00232 GATAT   88.95   6.28    -2.61   65275   65282
AT1G01010.1 10  ATATA   1   t   4   76.90   1.505   0.00299 ATATA   86.67   2.63    -3.31   65266   65275
AT1G01010.1 11  TATAC   1   t   5   97.65   2.540   0.00232 TATAC   95.09   6.28    0.36    65259   65266
AT1G01010.1 11  TATAC   1   t   6   92.51   2.549   0.00266 TATAC   95.09   6.28    -0.37   65251   65259
AT1G01010.1 12  ATACC   1   t   7   73.41   1.788   0.00930 ATACC   74.69   2.10    -0.54   65223   65251
AT1G01010.1 12  ATACC   1   t   8   71.55   1.331   0.00531 ATACC   74.69   2.10    -1.33   65207   65223
AT1G01010.1 13  TACCA   1   t   9   77.65   2.034   0.00266 TACCA   77.84   2.81    -0.06   65199   65207
AT1G01010.1 14  ACCAA   1   t   10  75.49   0.738   0.00564 ACCAA   74.58   2.11    0.38    65182   65199
AT1G01010.1 14  ACCAA   1   t   11  72.94   1.398   0.00697 ACCAA   74.58   2.11    -0.69   65161   65182
AT1G01010.1 15  CCAAA   1   t   12  84.03   2.852   0.00963 CCAAA   87.19   3.02    -0.93   65132   65161
AT1G01010.1 15  CCAAA   1   t   13  90.83   3.311   0.00299 CCAAA   87.19   3.02    1.08    65123   65132
AT1G01010.1 16  CAAAC   1   t   14  104.27  2.112   0.00199 CAAAC   104.22  2.68    0.02    65117   65123
AT1G01010.1 16  CAAAC   1   t   15  107.91  0.911   0.00299 CAAAC   104.22  2.68    1.23    65108   65117
AT1G01010.1 16  CAAAC   1   t   16  103.47  3.530   0.00365 CAAAC   104.22  2.68    -0.25   65097   65108
AT1G01010.1 17  AAACC   1   t   17  94.76   5.007   0.00266 AAACC   100.00  3.45    -1.35   65089   65097
AT1G01010.1 18  AACCA   1   t   18  82.91   2.254   0.00730 AACCA   82.92   2.81    -0.00   65067   65089
AT1G01010.1 19  ACCAG   1   t   19  75.00   1.675   0.00365 ACCAG   72.33   2.11    1.13    65056   65067
AT1G01010.1 19  ACCAG   1   t   20  73.95   1.884   0.00398 ACCAG   72.33   2.11    0.69    65044   65056
AT1G01010.1 19  ACCAG   1   t   21  73.16   2.420   0.00398 ACCAG   72.33   2.11    0.35    65032   65044
AT1G01010.1 20  CCAGA   1   t   22  94.73   10.344  0.00232 CCAGA   76.22   5.09    3.24    65025   65032
AT1G01010.1 20  CCAGA   1   t   23  71.16   1.432   0.00232 CCAGA   76.22   5.09    -0.88   65018   65025
AT1G01010.1 21  CAGAG   1   t   24  115.49  5.990   0.00232 CAGAG   106.73  5.87    1.33    65011   65018
AT1G01010.1 22  AGAGA   1   t   25  131.68  5.844   0.01926 AGAGA   127.71  5.69    0.62    64953   65011
AT1G01010.1 23  GAGAA   1   t   26  116.67  7.068   0.01394 GAGAA   125.76  5.87    -1.38   64911   64953
AT1G01010.1 24  AGAAA   1   t   27  129.21  4.249   0.00498 AGAAA   128.13  5.56    0.17    64896   64911

chrishendra93 commented 2 years ago

hi @YCCHEN23, I have pushed a quick fix for this that will skip the problematic positions and record the positions along with the non-unique 5-mers in a file called data.error. I have yet to push this to the main branch since we have not tested it. Can I please trouble you with testing this with your data?

To do so, you can clone the repository and run python setup.py install on the branch release_1.01. From there you can run m6anet-dataprep again and it should finish preprocessing all the data. Then if you have managed to do so, please share the content of data.error here so we can understand the problem better. Let me know if you encounter any difficulties!

Regards

Christopher Hendra

YCCHEN23 commented 2 years ago

Hi, thanks for the efficient update, I run m6Anet-dataprep again with the release_1.01 version, and it took about 7 hours to finish the process.

Everything goes well and I successfully obtain the data.result.csv from m6anet-run_inference.

After I checked the data.error & data.log, it seemed that the problem of "non-unique 5-mers" was caused by "I mapped the reads to cDNA (transcript) reference". However, m6anet is preparing the data at gene level, so "non-unique k-mers caused by different isoform" is not expected, is that correct?

If so, can I use m6anet to analysis the alignment results from cDNA reference?

I can send you apart of my eventalign result, cDNA reference, and data.error if needed.

Best regards

YCCHEN

chrishendra93 commented 2 years ago

hi @YCCHEN23, great to know that you have managed to run m6anet-dataprep on your samples. I hope that the inference results are useful for you

In any case, m6Anet prepares data on the transcript level, typically we require users to aggregate predictions made on the transcript level by themselves in order to get gene-level prediction. However during run-time, m6Anet-dataprep will run a check if a 5-mer is trully unique on the transcript level, so if you are aligning this to transcript, it confuses me as to why there can be non-unique 5-mers in that position.

You can perhaps show me the data.error results here? This way I can also check on my part if there are some bugs that we have to handle

Thanks!

Regards

Christopher Hendra

YCCHEN23 commented 2 years ago

Thank you for your clear explanation, but I'm just confused that I can't distinguish m6A sites information in transcript level from my result.

Take AT1G01040 as an example, I don't know which is the right transcript (AT1G01040.1 or AT1G01040.2?), because only gene id was recorded in the file. There is no isoform information showed in transcript_id (gene.1, gene.2, gene.3, etc)

Is the result recorded properly?

The following data are the first few lines of my result:

transcript_id,transcript_position,n_reads,probability_modified
AT1G01010,978,20,0.15509425
AT1G01010,1062,20,0.2124232
AT1G01010,1084,20,0.094096266
AT1G01010,1124,20,0.33268613
AT1G01010,1162,20,0.26048714
AT1G01010,1273,21,0.045965612
AT1G01010,1345,21,0.56280404
AT1G01010,1357,20,0.4143505
AT1G01040,4200,22,0.60451066
AT1G01040,4251,23,0.19029099
AT1G01040,4273,23,0.71105254
AT1G01040,4292,26,0.6048376
AT1G01040,4307,24,0.3069207
AT1G01040,4333,26,0.38802323
AT1G01040,4487,28,0.50886166
AT1G01040,4755,28,0.3398115
AT1G01040,4798,29,0.027187109
AT1G01040,4815,30,0.07250579
AT1G01040,4855,31,0.29441229

And sure, here's my result of data.error, hoping this comes in helpful for you.

There are 8835 sites recorded in the file, so I upload the zipped file.

data.error.zip

Best regards

YCCHEN

chrishendra93 commented 2 years ago

hi @YCCHEN23

I understand what's going on.

Apology, I think this is because there's a code left over from my previous experiments that will automatically remove the .1, .2, .3 from the transcripts and so they are grouped together in such a way.

I have push the fix to m6anet_release_1.01 that should fix this issue. Do you mind running this again to check? By the way if you have run this before and you still keep eventalign.index, you can just pass the argument --skip_index so that the program will skip indexing the entire eventalign file again (do check the first few entries of eventalign.index first, the transcript in there should be in the form of AT1G01040.1 or AT1G01040.2 etc)

YCCHEN23 commented 2 years ago

Never mind, I have performed the latest version and it still running. But now, the m6A information seemed to be recorded properly at transcript-level in all the outputs and there's no data.error occurring.

I'll close the comment within few days if there has no further errors.

Thanks for all the kindly helping.

Best regards

YCCHEN

chrishendra93 commented 2 years ago

Thanks for trying out! It's great to know that you have managed to run it without any problem. Do let me know if you encounter any other problems

Regards

Christopher Hendra

Stakaitis commented 1 year ago

head_eventalign.txt head_eventalign.index.txt head_data.readcount.txt head_data.log.txt head_data.json.txt head_data.index.txt Hi @chrishendra93,

I'm getting the same AssertionError while working with reads aligned to reference genome (I know that I should use transcriptome instead, but I'm curious how different the results will be if ref genome is used):

  Process Consumer-19:
  Traceback (most recent call last):
    File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
      self. Run()
    File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/site-packages/m6anet/scripts/helper.py", line 113, in run
      result = self.task_function(*next_task_args,self.locks)
   File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/site-packages/m6anet/scripts/dataprep.py", line 310, in preprocess_tx
     assert(len(kmer) == 1)
 AssertionError

Oddly, only 1 out of 4 samples rises this error, others finish the inference step successfully.

I'm attaching the first lines of some input/output files of the erroneous sample

Versions:

m6Anet - I believe I'm using the latest version of m6Anet, since I'm able to parse the --infer_mod_rate argument, though, I'm not sure how to check the precise version I'm using.
nanopolish version 0.13.2
samtools 1.15.1
htslib 1.15.1
minimap2 2.17-r941
Python 3.9.0
pandas 1.4.2

Commands: Ref_file=GRCh38.primary_assembly.genome.fa

minimap2 -ax splice -k14 ${Ref_index} ${Reads_fq} | samtools sort -K 14 -o ${BAM}

samtools view --bam --with-header --output ${bam_filtered} --output-unselected ${bam_unselected} --min-MQ 1 --excl-flags UNMAP,SECONDARY,SUPPLEMENTARY ${BAM}

samtools index ${bam_filtered}

nanopolish index --directory ${Fast5_dir} -s ${Summary_file} ${Reads_fq}

nanopolish eventalign --reads ${Reads_fq} --bam ${bam_filtered} --genome ${Ref_file} --scale-events --signal-index > ${EVENTALIGN}

m6anet-dataprep --eventalign ${EVENTALIGN} --out_dir ${DATAPREP_DIR} --n_processes ${CPUS} --readcount_min 2 --min_segment_count 2

m6anet-run_inference --input_dir ${DATAPREP_DIR} --out_dir ${INFERENCE_DIR} --infer_mod_rate --n_processes ${CPUS}

What kind of help do I seek? What changes and to which script files should I make to avoid this error and instead output an additional data.error file?

chrishendra93 commented 1 year ago

hi @Stakaitis, seems like due to the genome mapping, there are two possible 5-mer in that specific site and so the code throws an error (which I set deliberately to prevent having multiple possible 5-mers in here). I think this might only be observed in one out of four of your samples, you can add a print statement to the code to see the specific site that throws this error. Usually what I do with genomic coordinates is to map them after m6anet inference and average m6A probability per genomic coordinate.

Stakaitis commented 1 year ago

Thank you for a quick reply.

Where exactly should I add a print statement in the dataprep.py file?
How this print statement should be written?
Is there an easy way how to exclude these overlapping 5-mers, and avoid this error allowing inference step to finish successfully while working with reference genome?

For now, I will use your advice for getting m6A location at the genomic coordinates

Thanks!

chrishendra93 commented 1 year ago

hi @Stakaitis sorry for the delay in my reply, I was busy the entire last week.

I suggested printing the kmer just to see if the splicing site is indeed the case. I think one way to handle this will be to modify the dataset class in m6anet.utils.data_utils.py to output the k-mer separately but currently, we need to modify the dataprep.py too in order to accommodate this behavior. On the other hand, mapping this to the transcriptome produces probability on the read-level that gives more overall control over the output

Stakaitis commented 1 year ago

I agree, it makes more sense to analyse reads mapped to the transcriptome. Thanks!

GoekeLab / m6anet

PerformanceWarning & AssertionError in m6anet-dataprep #23