Closed jolespin closed 3 years ago
Hi @jolespin pycoMeth is expecting the following field from nanopolish call-methylation output: "chromosome", "strand", "start", "end" , "read_name", "log_lik_ratio", "log_lik_methylated", "log_lik_unmethylated" Looking at the log, it doesn't seem like you have a "strand" field. What version of Nanopolish do you use ?
Gzip format is not an issue. You can see that the files are actually all opened in gzip mode
I ended up using the optimized reimplementation:
https://github.com/hasindu2008/f5c
It’s supposed to be a drop in replacement but I’ll check to see which fields are not consistent.
@hasindu2008 do you have any insight?
I could parse the bam files to add this column manually maybe.
On Jan 23, 2021, at 6:52 AM, Adrien Leger notifications@github.com wrote:
Gzip format is not an issue. You can see that the files are actually all opened in gzip mode
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@jolespin
when you call methylation add --meth-out-version 2
to f5c
to get the same fields as the latest nanopolish versions.
--meth-out-version INT Format version of the output Methylation tsv file If set to 1, the columns printed adhere to the output format of Nanopolish early versions. If set to 2, adhere to the latest nanopolish output format that additionally includes the strand column and the header num_cpgs renamed to num_motifs) [default value: 1]
@jolespin
when you call methylation add
--meth-out-version 2
tof5c
to get the same fields as the latest nanopolish versions.--meth-out-version INT
Format version of the output Methylation tsv file
If set to 1, the columns printed adhere to the output format of Nanopolish early versions. If set to 2, adhere to the latest nanopolish output format that additionally includes the strand column and the header num_cpgs renamed to num_motifs) [default value: 1]
Hmm... I just ran f5c on hundreds of samples. Is there any way I can convert?
Alternatively, is there an earlier version of pycometh that handles earlier versions of Nanopolish?
@jolespin
An year ago, I ran pycoMeth (pycoMeth v0.3.4) on such samples by manually renaming the num_cpgs
to num_motifs
in the f5c tsv outputs. Not sure if the strand column is used in latest pycometh for any calculations, but at that time it was not required I guess.
After renaming they looked like:
chromosome start end read_name log_lik_ratio log_lik_methylated log_lik_unmethylated num_calling_strands num_motifs sequence
chr2 124998 124998 72d14342-4eb2-4d6e-8a8e-bd65ea179dc2 4.26 -111.34 -115.60 1 1 ACCTGCGAACA
Command was
pycoMeth CpG_Aggregate -i $input -f hg38noAlt.fa -t $each.tsv -b $each.bed --progress
@hasindu2008 Thanks, I got it to run w/ pycoMeth v03.4 after relabeling num_cpgs
to num_motifs
. I've changed my f5c wrapper to use --meth-out-version 2
by default to avoid this in the future.
Do you know of a quicker way to relabel the header? I did a very naive way for my first batch of files by loading everything into pandas
(Python), relabeling, and then writing to disk.
I tried running the following command:
But I got the following error. Is there a different syntax I need to use to include multiple files in the input?
Also, if I give a file that includes a list of file names, what should the header be?
Can these be gzipped? It says they can but I just want to double check.