al-mcintyre / mCaller

A python program to call methylation (m6A in DNA) from nanopore signal data
MIT License
43 stars 16 forks source link

model is not recognizing reference #25

Open wshropshire opened 3 years ago

wshropshire commented 3 years ago

Hello,

I am trying to run the m6a model with a reference assembly I generated using flye. However, I'm getting this error:

python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -t 24 -m CRAANNNNNNNTGC -r ./6180WT/6180WT_flye_assembly.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r95_twobase_model_NN_6_m6A.pkl -e 6180WT.eventalign.tsv -b A -f barcode04_6180WT.fastq

1 contigs 24 threads Process Process-2: Error: could not find sequence for reference contig g_1 Error: could not find sequence for reference contig contig Process Process-11: Error: could not find sequence for reference contig 14632 Error: could not find sequence for reference contig ntig_1 Process Process-8: Traceback (most recent call last): File "/data/opt/programs/etc/python/v3.7.0/Python-3.7.0/make_location/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run()

Any insights into how to resolve this would be appreciated.

al-mcintyre commented 3 years ago

Hello,

How many contigs do you have in your reference? Does mCaller fail for all of them? It could be one of several things: 1) the names of your reference contigs don't match the names of the contigs in your nanopolish tsv, 2) there is an issue with your reference fasta file format, 3) there is still an issue with handling motifs in this format and mCaller will work if you provide a list of positions instead. Assuming 1 and 2 are in order, perhaps run a quick check by providing a few positions of interest. I may need to update how mCaller handles motifs; I'm not sure at the moment if 'N's and 'R's are permitted.

wshropshire commented 3 years ago

Thanks again for the quick response!

I only have one contig in my reference. I did though attempt to just run mCaller with just the motif "A" and it appears to be writing to a temp file currently. However, it still returns a warning:

` python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -m A -b A -r 6180WT_flye_assembly.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r94_model_NN_6_m6A.pkl -e 6180WT.eventalign2.tsv -f barcode04_6180WT.fastq

1 contigs 1 threads Error: could not find sequence for reference contig contig`

That being said, looks like it's still writing to a file. Therefore, I'm thinking it may have something to do with the non-standard IUPAC codes. Out of curiosity, how did you subset the type I R-M motifs with the bipartite pattern (e.g. CAAYNNNNNNNTTYG) that you have in the figure presented in your paper and on GitHub? Was this based on a position file or just subsetting from a larger file with all known m6A methylations?

Thanks!

Will

al-mcintyre commented 3 years ago

For the paper, I used the positions generated by PacBio. I'm now remembering that IUPAC motif handling was one of the things I meant to get around to and never did. It shouldn't be too difficult to implement using re though.

It looks like you may have some reads assigned to a contig named "contig" that doesn't appear in your reference file -- maybe double-check this as well.

wshropshire commented 3 years ago

Yeah, I believe you could potentially just use regex for degenerate bases in the motif option.

To circle back though to the positions generated by PacBio which you reported on in your paper, I'm curious how you were able to use those motifs that have degenerate positions (e.g. N, R, Y) to report the percent that were methylated? Did you just subset from a larger file all these positions or instead used a positions document?

I only have one ~1.9M contig with the header '>contig_1', but perhaps there is an issue with how nanopolish eventalign parses headers. I'll try to check ASAP, but my old institution for some reason disabled my guest access to the server I'm using so have remedy that first :(.

Thanks for all your help!

Will

wshropshire commented 3 years ago

Sorry I read your response more carefully and see you used a positions document from PacBio. I'll see if I can find that document from the PacBio output that was analyzed by another institution.

Any suggestions though on how to analyze isolates with type I R-M motifs that do not have corresponding PacBio data? I'll see if I can look at your code and come up with a regex solution. I'm not the best programmer though :).

Thanks!

Will

al-mcintyre commented 3 years ago

First attempt:

import re

iupac_dict = {
  'A':'A','C':'C','G':'G','T':'T','U':'U',
  'R':'AG','Y':'CT','S':'GC','W':'AT','K':'GT',
  'M':'AC','B':'CGT','D':'AGT','H':'ACT','V':'ACG','N':'ACGT'}

will='CRAANNNNNNNTGC'
willEG = 'CGAACCCCCCCTGC'
iupacWill = ''.join(['['+iupac_dict[c]+']' for c in will])
print(iupacWill)
ref_seq = willEG+'xxxxx'+willEG+'yyyyy'+willEG+'zzzzz'
print(ref_seq)
ref_seq = re.sub(iupacWill,'M',ref_seq)
print(ref_seq)

This seems to work from my quick test. In the place of ref_seq.replace(motif,meth_motif), you would instead adapt use re.sub after converting your motif to the appropriate format. If you don't manage, I will try to integrate this and update tomorrow or Friday.

wshropshire commented 3 years ago

Okay great, just running the code above, I think this regex solution should work as well with the IUPAC dictionary substituting for the current dict in extract_contexts.py. I'll try to integrate this new dict and modify the methylate_motifs function in extract_contexts.py tomorrow and run on my data. I'll let you know if I'm not able to get it to work with your modified script on the server and defer to your superior programming competencies. Little puzzled why the script seems to be picking up an extra contig from my reference, but I'll also see if I can figure that out. I'll have a lot of time tomorrow morning to look at this after sequencing the past three days.

Thanks again for helping out! This seems to be the exact modeling we need for our experiment.

Will

wshropshire commented 3 years ago

Hey sorry just getting to this: so should I include the iupac_dict in the else statement for the def methylate_motifs(ref_seq,motif,meth_base,meth_position=None): function and then adapt meth_motif based on the 'iupacWill' loop statement? Sorry if this is straight forward, but didn't get a chance to get to this until today and my server login is down and my institutions' IT service is not available since it's Easter Sunday. Hoping to test this tomorrow with my data, but I think that would be the only script changes necessary correct?

Will

al-mcintyre commented 3 years ago

Hi Will, A slightly different strategy for compatibility with the current model structure; try replacing the first portion of extract_contexts with the following:

base_comps = {'A':'T','C':'G','T':'A','G':'C','N':'N','M':'M'}
iupac_dict = {
  'A':'A','C':'C','G':'G','T':'T','U':'U',
  'R':'AG','Y':'CT','S':'GC','W':'AT','K':'GT',
  'M':'AC','B':'CGT','D':'AGT','H':'ACT','V':'ACG','N':'ACGT'}

#@profile
def comp(seq,base_comps=base_comps):
   return ''.join([base_comps[nt] for nt in list(seq)])

#@profile
def revcomp(seq,rev=True):
   if not rev:
      return seq
   else:
      return ''.join(list(comp(seq))[::-1])

#@profile
def strand(rev):
   if rev:
      return '-'
   else:
      return '+'

#find positions of motifs (eg. CG bases) in reference sequence and change to M
#@profile
def methylate_motifs(ref_seq,motif,meth_base,meth_position=None):
   iupacMotif = ''.join(['['+iupac_dict[c]+']' for c in motif])
   iu = re.compile(iupacMotif)
   if not meth_position:
      meth_positions = [m.start() for m in re.finditer(meth_base,motif)]
   else:
      meth_positions = [meth_position]
   positions = [match.start()+mpos for match in iu.finditer(ref_seq) for mpos in meth_positions]
   meth_seq = methylate_positions(ref_seq,positions,meth_base)
   return meth_seq
wshropshire commented 3 years ago

Hello,

Thank you so much for helping me. I modified the code to include the new dictionary and the new methylate motifs function (I definitely wasn't close to getting the correct regex solution to substitute the motifs). Nevertheless, after running this workflow from the beginning and seeing if changing my reference fasta from a multi-line to single-line fasta file, I still get the error message:

`python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -m CRAANNNNNNNTGC -r 6180WT_flye_assembly_oneLine.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r94_model_NN_6_m6A.pkl -e barcode04_6180WT.eventalign.tsv -f barcode04_6180WT.fastq -b A

1 contigs 1 threads Error: could not find sequence for reference contig contig Traceback (most recent call last): File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py", line 187, in main() File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py", line 184, in main args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py", line 58, in distribute_threads extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 161, in extract_features meth_fwd,meth_rev = find_and_methylate(fasta_input,chrom,base,motif,positions_list) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 85, in find_and_methylate meth_fwd,meth_rev = methylate_references(str(ref.seq).upper(),base,motif=motif,positions=positions_list,contig=contigname) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 68, in methylate_references meth_rev = methylate_motifs(ref_seq,revcomp(motif),base_comps[base]) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 26, in revcomp return ''.join(list(comp(seq))[::-1]) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 19, in comp return ''.join([base_comps[nt] for nt in list(seq)]) File "/data/opt/programs/etc/mCaller/mCaller-1.0.3/extract_contexts.py", line 19, in return ''.join([base_comps[nt] for nt in list(seq)]) KeyError: 'R'`

Really not sure why it identifies multiple contigs. There is only a contig_1 file. I can show you the fasta/fastq files if that would help...

Thanks!

Will

al-mcintyre commented 3 years ago

Oops this is an error with mapping the base 'R' when converting the motif to its reverse complement. Try this: base_comps = {'A':'T','C':'G','T':'A','G':'C','N':'N','M':'M','R':'Y','Y':'R','S':'S','W':'W','K':'M','M':'K','B':'V','D':'H','H':'D','V':'B','N':'N'}

al-mcintyre commented 3 years ago

The "contig" issue is still strange. Can you check the eventalign tsv to see if any events are mapped to "contig". Most likely error would be at the end if the file is truncated or something, so try tail.

wshropshire commented 3 years ago

Adding the additional degenerate bases for the reverse complement fixed that error. I looked at the beginning and end of the eventalign file and it looks fine:

Screen Shot 2021-04-10 at 12 28 22 PM

I am getting standard output though that looks promising!

`python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -m CRAANNNNNNNTGC -r 6180WT_flye_assembly_oneLine.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r94_model_NN_6_m6A.pkl -e barcode04_6180WT.eventalign.tsv -f barcode04_6180WT.fastq -b A

1 contigs 1 threads Error: could not find sequence for reference contig contig 0fcb2c0a-8b2f-464d-9d4e-cf735e2b2d88 16175 GTCAKKTCGAA 5.095000000000001,2.91,1.3633333333333333,0.67,4.43,6.085000000000001,23.362945556640625 - 0fcb2c0a-8b2f-464d-9d4e-cf735e2b2d88 True 0fcb2c0a-8b2f-464d-9d4e-cf735e2b2d88 True 22818 TTGACG MTGACG CGTCAA 0 1 0`

Would be pretty surprised though if there is only one 'CRAANNNNNNNTGC' motif in this genome...

wshropshire commented 3 years ago

Oh, is it possibly because it's recognizing the header from the eventalign file?

al-mcintyre commented 3 years ago

try without the header! That could be it.

wshropshire commented 3 years ago

So small victories as the header in the eventalign file was what was responsible for the error message. However, still puzzled why (1) there is only one motif that is being recognized and (2) the motif doesn't seem to be a match in terms of on the flanking regions of the 14-based k-mer, nor does it seem to be the right size (i.e. GTCAKKTCGAA /= CRAANNNNNNNTGC nor does it's reverse complement). Could be a naive question, but does the nanopolish eventsalign have 14-based k-mers that it detects? Seems like all the model events are 6-mers. Still would think that since you were able to detect these type 1 motifs using a positions file for your paper, that you somehow were able to extract the probability info based on the read alignment with the eventsalign document. I'm going to post a related question on the nanopolish website to see if I get any feedback.

Thanks again for the continued help!

Will

al-mcintyre commented 3 years ago

1) use of 6-mers wouldn't matter since the motif positions are detected in the genome not in the eventalign file. 2) this is a very weird issue. Does your genome have ambiguous bases (like K)? Can you send me your modified extract_contexts.py file to take a look? What is the length of the output file (i.e. how many reads for the one position it recognizes)?

al-mcintyre commented 3 years ago

Ah, I see the issue :) this is on the negative strand and I've introduced an error with the new base_comps dict. Take out 'M':'K' in that dict (I will have to fix how I'm handling this eventually, but it's not relevant to your motif anyway) and it should show that motif as GTCAMMTCGAA, where the 'M's represent sites of interest (the adenines in your motif). CAMMTCGAA = CRAANNNNN, the first part of your motif. You can verify this in your genome fasta. It's still strange that it's only identifying one instance of this motif in your genome, though. Can you also send me your genome file to test? Maybe there's an issue with the regex. (email abm237 at cornell.edu if you don't want to post here)

wshropshire commented 3 years ago

Is this an issue with 'M' being set as the motif variable? The draft genome I created from is Flye and doesn't have any ambiguous bases and on stdout it looks like only one read is mapping to this motif and there isn't a *diffs.6 file that is being output (which again is weird). I took out the 'M':'K' key/value pair and it looks like it is now recognizing more motifs! I'm going to be testing this on another motif once I figure out how to abstract these positions and analyze them, but looks like everything is working with at least this particular fix. Thank you so much!

Will

wshropshire commented 3 years ago

Oh sorry, but to make sure I don't run into any other downstream errors, do I also need to remove 'K':'M' from the base_comps dict? Feel as if the IUPAC code for 'M' needs to be compatible with the Motif variable, but also could be entirely off-base here :).

al-mcintyre commented 3 years ago

That should be fine, since neither your motif nor your genome includes K and that conversion will never be called. But this is not a good fix for mCaller overall and something I need to take into consideration if I want to incorporate motif detection in the next version, or if you ever want to use your version for detection of motifs with K. Yes, this is all an issue with me setting sites to predict to M.

wshropshire commented 3 years ago

Okay great. It also looks like that for the other set of isolates I'm testing to see if the type 1 methylase was knocked out, it recognizes the same motif so should be straight forward. Would you like me to email you the modified extract_contexts.py script or maybe just fork this repo? Can do whatever works best for you!

Will

al-mcintyre commented 3 years ago

Email please. Thanks!