Methylation analysis of Nanopore data

ifalconi20 commented 1 year ago

I have the sequencing data in Nanopore of a small amplicon (230 bp) of the HLA-C gene. One of the challenges I face is the low quality of the sequences obtained, as well as the fragmentation of the sequences (most have the size of the amplicon, but there are others that are smaller, some up to 100 bp and some larger, up to 700-1000 bp). This is because I do not have a specific kit for short sequences, so I had to use a kit designed for long sequences, which caused fragmentation of my sequences due to the action of the transposase.

My goal is to get the most out of the reads obtained, so I performed three sequences using the same method to increase coverage. First, I used the Porechop tool to remove barcodes designed for Nanopore and possible chimeras. Then, I employed another sequence trimming tool called Filtlong to maintain a length range between 44 and 230 bp in order to optimize the mapping process.

Although I am still unsure whether the mapping should be directional or non-directional, I initially performed the mapping in a directional manner. However, I experienced low mapping efficiency, ranging from 7% to 15% in all my samples. After researching in the official guide and consulting some discussions in specialized forums, I decided to use the following command to improve the results:

bismark --score_min L,0,-10 --phred33 -N 1 -L 10 --genome (Referencia) (archivo)

In the hope of increasing mapping efficiency, I adjusted the parameters of the Bismark command. I set a minimum score value (score_min) at L,0,-21, used a phred33 sequence quality format, and allowed up to 1 error in alignment (-N 1). In addition, I set a minimum alignment length threshold at 10 (-L 10). Finally, I indicated the reference genome that I used in the mapping process. With the adjustments made to the Bismark command, I managed to improve the mapping efficiency to a range between 48% and 62%. However, I would like to confirm if the implementation I have carried out is adequate or if another approach or strategy for data handling is needed.

I have also faced problems where some sequences try to align to the end instead of the start of the sequence. Despite adding "NN" to both the beginning and end of the reference sequence, it has not completely solved the problem. It is important to note that this situation occurs in a relatively low number of entries, around 40 or 60 compared to the total number of reads, so the difference is not significant.

FelixKrueger commented 1 year ago

Hi @ifalconi20

Ahh, you've got Nanopore data. May I ask if the data was bisulfite converted at all, or using a similar method such as EM-seq? Please note that Bismark doesn't just do a methylation analysis of sequencing data, but it is explicitly designed to work with bisulfite-seq data.

Bismark supports the use of minimap2 to align Nanopore reads, I am sure this would be a much better option than tweaking the options for Bowtie2 alignments, see here: https://github.com/FelixKrueger/Bismark/releases/tag/0.24.0

As an example, for a 200bp long read with a --score_min L,0,-21 would allow a penalty score of -4200, or 700 mismatches (!) and or indels, which seems to be somewhat overly lenient. :)

ifalconi20 commented 1 year ago

Like RRBS, I did a bisulfite treatment to the extracted dna from saliva and amplified a region with a large proportion of GC within the gene, and prepared the library with Nanopore and sequenced.

How should I take directional or non-directional mapping?

Sorry if some things don't seem correct to what I'm trying to do, I'm new to using these tools.

Thanks for the minimap2 recommendation, I will apply it.

FelixKrueger commented 1 year ago

I'm afraid I can't tell you in advance which type of mapping you will have to do as this depends entirely on the exact details used to generate the library. This should become obvious once you have performed the alignment in non-directional mode though, e.g, with only a limited number of reads (e.g.: -u 10000 (and don't forget to use --minimap...)).

For amplicon libraries you might get only OT (or (OB) alignments, sometimes it can be a combination of such as OT and CTOT, or the like. Happy to take a look at the mapping report once it is in.

ifalconi20 commented 1 year ago

Now what I did was to prepare the reference genome with minimap2.

Bismark_genome_preparation --minimap2 --verbose /Reference

Then I did the mapping with minimap2, as the Nanopore option is default then I did not specify that.

Bismark --minimap2 --genome /Reference Barcode01.fastq.gz

This was the output for directional

`Minimap2 is assumed to be in PATH, using: 'minimap2'
minimap2 seems to be working fine (tested command 'minimap2 --version' [2.24-r1122])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/    (absolute path is '/home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/)'
FastQ format assumed (by default)

Input files to be analysed (in current folder '/home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/02. filtlong/270'):
Barcode01_filtlong.fastq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Setting parallelization to single-threaded (default)

Using a maximum length cutoff of 10000 bp
Using preset for ONT mapping (-x map-ont)

Summary of all aligner options:    -a --MD --secondary=no -t 2 -x map-ont -K 250K
Current working directory is: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/02. filtlong/270

Now reading in and storing sequence information of the genome specified in: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/

chr HLA-C (223 bp)

Single-core mode: setting pid to 1

Single-end alignments will be performed
=======================================

Input file is in FastQ format
Writing a C -> T converted version of the input file Barcode01_filtlong.fastq.gz to Barcode01_filtlong.fastq.gz_C_to_T.fastq

Created C -> T converted version of the FastQ file Barcode01_filtlong.fastq.gz (14791 sequences in total)

Input file is Barcode01_filtlong.fastq.gz_C_to_T.fastq (FastQ)

Now running 2 instances of minimap2 against the bisulfite genome of /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/ with the specified options: -a --MD --secondary=no -t 2 -x map-ont -K 250K

Now starting >> minimap2 << for CTreadCTgenome (reading in sequences from Barcode01_filtlong.fastq.gz_C_to_T.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi Barcode01_filtlong.fastq.gz_C_to_T.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.001*1.64] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*1.59] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*1.58] distinct minimizers: 36 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.194; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    0    HLA-C_CT_converted    134    39    74M30S    *    0    0    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    NM:i:0    ms:i:148    AS:i:148    nn:i:0    tp:A:P    cm:i:8s1:i:63    s2:i:0    de:f:0    MD:Z:74    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    0    HLA-C_CT_converted    134    39    74M30S    *    0    0TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    NM:i:0    ms:i:148    AS:i:148    nn:i:0    tp:A:P    cm:i:8    s1:i:63    s2:i:0    de:f:0    MD:Z:74rl:i:0
Now starting >> minimap2 << for CTreadGAgenome (reading in sequences from Barcode01_filtlong.fastq.gz_C_to_T.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi Barcode01_filtlong.fastq.gz_C_to_T.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.001*1.56] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*1.54] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*1.53] distinct minimizers: 37 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.027; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    00    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    0    0    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0

>>> Writing bisulfite mapping results to Barcode01_filtlong_bismark_mm2.bam <<<

Reading in the sequence file Barcode01_filtlong.fastq.gz
[M::worker_pipeline::1.071*0.02] mapped 1916 sequences
[M::worker_pipeline::1.063*0.01] mapped 1916 sequences
[M::worker_pipeline::1.124*0.02] mapped 1916 sequences
[M::worker_pipeline::1.132*0.03] mapped 1916 sequences
[M::worker_pipeline::1.186*0.03] mapped 1881 sequences
[M::worker_pipeline::1.195*0.04] mapped 1881 sequences
[M::worker_pipeline::1.252*0.03] mapped 1893 sequences
[M::worker_pipeline::1.261*0.05] mapped 1893 sequences
[M::worker_pipeline::1.323*0.06] mapped 1905 sequences
[M::worker_pipeline::1.315*0.04] mapped 1905 sequences

`

This was the output for non-directional

`Minimap2 is assumed to be in PATH, using: 'minimap2'
minimap2 seems to be working fine (tested command 'minimap2 --version' [2.24-r1122])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/    (absolute path is '/home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/)'
FastQ format assumed (by default)

Input files to be analysed (in current folder '/home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/02. filtlong/270/non'):
Barcode01_filtlong.fastq.gz
Library was specified to be not strand-specific (non-directional), therefore alignments to all four possible bisulfite strands (OT, CTOT, OB and CTOB) will be reported
Setting parallelization to single-threaded (default)

Using a maximum length cutoff of 10000 bp
Using preset for ONT mapping (-x map-ont)

Summary of all aligner options:    -a --MD --secondary=no -t 2 -x map-ont -K 250K
Current working directory is: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/02. filtlong/270/non

Now reading in and storing sequence information of the genome specified in: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/

chr HLA-C (223 bp)

Single-core mode: setting pid to 1

Single-end alignments will be performed
=======================================

Input file is in FastQ format
Writing a C -> T converted version of the input file Barcode01_filtlong.fastq.gz to Barcode01_filtlong.fastq.gz_C_to_T.fastq
Writing a G -> A converted version of the input file Barcode01_filtlong.fastq.gz to Barcode01_filtlong.fastq.gz_G_to_A.fastq

Created C -> T as well as G -> A converted versions of the FastQ file Barcode01_filtlong.fastq.gz (14791 sequences in total)

Input files are Barcode01_filtlong.fastq.gz_C_to_T.fastq and Barcode01_filtlong.fastq.gz_G_to_A.fastq (FastQ)

Now running 4 individual instances of minimap2 against the bisulfite genome of /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/ with the specified options: -a --MD --secondary=no -t 2 -x map-ont -K 250K

Now starting >> minimap2 << for CTreadCTgenome (reading in sequences from Barcode01_filtlong.fastq.gz_C_to_T.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi Barcode01_filtlong.fastq.gz_C_to_T.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.001*1.62] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*1.60] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*1.58] distinct minimizers: 36 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.194; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    0    HLA-C_CT_converted    134    39    74M30S    *    0    0    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    NM:i:0    ms:i:148    AS:i:148    nn:i:0    tp:A:P    cm:i:8s1:i:63    s2:i:0    de:f:0    MD:Z:74    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    0    HLA-C_CT_converted    134    39    74M30S    *    0    0TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    NM:i:0    ms:i:148    AS:i:148    nn:i:0    tp:A:P    cm:i:8    s1:i:63    s2:i:0    de:f:0    MD:Z:74rl:i:0
Now starting >> minimap2 << for CTreadGAgenome (reading in sequences from Barcode01_filtlong.fastq.gz_C_to_T.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi Barcode01_filtlong.fastq.gz_C_to_T.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.001*1.64] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*1.60] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*1.59] distinct minimizers: 37 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.027; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    00    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    0    0    TGGATTGTTGTGGATATTGTGGTTTAGATTATTTAGTGTAAGTTGGAGGTGGTTTGTGTGGTGGAGTAGTTGAGGATTTGTGTTATTTGTATTTTGTTTAGATG    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
Now starting >> minimap2 << for GAreadCTgenome (reading in sequences from Barcode01_filtlong.fastq.gz_G_to_A.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/CT_conversion/BS_CT.mmi Barcode01_filtlong.fastq.gz_G_to_A.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.002*1.54] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.002*1.53] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.002*1.52] distinct minimizers: 36 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.194; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    00    TAAATTATTATAAATATTATAATTTAAATTATTTAATATAAATTAAAAATAATTTATATAATAAAATAATTAAAAATTTACATCACTTATACTTCATTCAAACA    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    0    0    TAAATTATTATAAATATTATAATTTAAATTATTTAATATAAATTAAAAATAATTTATATAATAAAATAATTAAAAATTTACATCACTTATACTTCATTCAAACA    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
Now starting >> minimap2 << for GAreadGAgenome (reading in sequences from Barcode01_filtlong.fastq.gz_G_to_A.fastq with options -a --MD --secondary=no -t 2 -x map-ont -K 250K)
Using minimap2 index: /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi

Minimap2 command line:
minimap2 -a --MD --secondary=no -t 2 -x map-ont -K 250K /home/ifalconi/Documents/Secuencias_HLA-C/Secuencia/Reference/Bisulfite_Genome/GA_conversion/BS_GA.mmi Barcode01_filtlong.fastq.gz_G_to_A.fastq
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::0.001*1.73] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*1.65] mid_occ = 10
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*1.63] distinct minimizers: 37 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.027; total length: 223
Found first alignment:    c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    00    TAAATTATTATAAATATTATAATTTAAATTATTTAATATAAATTAAAAATAATTTATATAATAAAATAATTAAAAATTTACATCACTTATACTTCATTCAAACA    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0
storing c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01 and
c9609726-8827-49b5-876d-0c15f1eb4d0d_runid=d8ebeca6e8635a47e09bc7283df833a641fd2581_read=21_ch=214_start_time=2021-10-09T20:36:12Z_flow_cell_id=FAO49264_protocol_group_id=HLA_C_Final_sample_id=no_sample_barcode=barcode01_barcode_alias=barcode01    4    *    0    0    *    *    0    0    TAAATTATTATAAATATTATAATTTAAATTATTTAATATAAATTAAAAATAATTTATATAATAAAATAATTAAAAATTTACATCACTTATACTTCATTCAAACA    ?1..../;<<>=<<==@A>>>>=>;;99::>@JEA?>>;:678AA>=20.0>>=@=<<D@;;=>>55;>@@@:1.-,*)'%$%%%%%''*,-//54499,+*'$    rl:i:0

>>> Writing bisulfite mapping results to Barcode01_filtlong_bismark_mm2.bam <<<

Reading in the sequence file Barcode01_filtlong.fastq.gz
[M::worker_pipeline::1.287*0.01] mapped 1916 sequences
[M::worker_pipeline::1.261*0.01] mapped 1916 sequences
[M::worker_pipeline::1.295*0.02] mapped 1916 sequences
[M::worker_pipeline::1.283*0.05] mapped 1916 sequences
[M::worker_pipeline::1.540*0.02] mapped 1916 sequences
[M::worker_pipeline::1.566*0.02] mapped 1916 sequences
[M::worker_pipeline::1.574*0.02] mapped 1916 sequences
[M::worker_pipeline::1.561*0.06] mapped 1916 sequences
[M::worker_pipeline::1.840*0.02] mapped 1881 sequences
[M::worker_pipeline::1.814*0.02] mapped 1881 sequences
[M::worker_pipeline::1.849*0.03] mapped 1881 sequences
[M::worker_pipeline::1.838*0.07] mapped 1881 sequences
[M::worker_pipeline::2.118*0.02] mapped 1893 sequences
[M::worker_pipeline::2.092*0.02] mapped 1893 sequences
[M::worker_pipeline::2.126*0.03] mapped 1893 sequences
[M::worker_pipeline::2.114*0.07] mapped 1893 sequences
[M::worker_pipeline::2.383*0.02] mapped 1905 sequences
[M::worker_pipeline::2.357*0.02] mapped 1905 sequences
[M::worker_pipeline::2.392*0.04] mapped 1905 sequences
[M::worker_pipeline::2.379*0.08] mapped 1905 sequences

Apparently it does not show me the mapping efficiency or methylation percentage.

When I do the methylation extraction and report, I get incomplete.

Bismark_methylation_extract --bedGraph --buffer_size 10G --cytosine_report --genome_folder /Reference Barcode01_bismark_mm2.bam

When I did the alignment with Bowtie2 I got 50.1% methylation percentage, with minimap2 I get 2.6%. Why is that? What details would I need to know to know what type of mapping? If you could review what I have analyzed it would be a great help to me to understand. I leave you the file with the report and everything.

https://drive.google.com/file/d/17uDS2TbVWL-HQUzTa-G6y84y_NBK14Ma/view?usp=sharing

ifalconi20 commented 1 year ago

Okay, I think the problem with it not showing up was with my machine, but then it worked better. To improve the mapping results, I applied more stringent criteria in quality control. It is important to mention that my sequences are between 44 and 220 base pairs, and I mapped them using the amplicon designed for the HLA-C gene as a reference. Then I tried using minimap2 to convert the reference genome and map it with the --minimap2/mm2_nanopore option, I did it in a--non-directional way and now I get a mapping efficiency between 70% and 90% in my samples. On the other hand, the methylation percentage is low, between 0.9% and 4.5%, after the deduplication step, as it is a normal amplification of a bisulfite treated amplicon.

I have been reading that minimap2 aligner is commonly used for Nanopore sequencing, but you do not mention its use for small amplicons. In your experience, do you think this methylation percentage is adequate for short reads? Do you think I am following the right path? What advice would you give me to supplement the analysis of this data?

FelixKrueger commented 1 year ago

Sorry for not replying earlier, it was quite bust - but I'm glad you are now getting better results!

To be honest I am not sure you should include a deduplication step, as you might see a lot of duplicate reads by design?

I can't really comment much on which methylation one would expect, as this is entire situation, cell type and context-dependent. We have seen anything from no methylation (e.g. Dnmt KOs, reprogramming), to 90% in sperm...

FelixKrueger commented 1 year ago

it might be a good idea to use a short-read sequencing platform to confirm the Nanopore results, e.g. a MiSeq run should easily be able to sequence reads of the length you have...

ifalconi20 commented 1 year ago

Thank you very much for the support :D The deduplication step was taken to eliminate the "noise" in the mapping, since in PCR it can generate over amplification, that's why I used it. I don't know if I am taking it right. Unfortunately I can't sequence again, but can I complement the information using the files obtained from the mapper to make another methylation call with another bioinformatic tool? What do you recommend?

FelixKrueger commented 1 year ago

If you take a look at slide 33 Dedduplication - consideration of this presentation you can see that whether or not deduplication is advisable depends very much on how the libraries were constructed. If you amplified a small target region (which generally is what the term 'amplicon' suggests) then you would expect every single fragment to start and end at the very same position (similar to the RRBS example in the slide linked above). In such a case, deduplication would discard a lot of information. The only reliable way of discriminating between PCR duplicates and genuinely different fragments coming from this locus you would have to include UMIs into different fragments.

I am not sure whether there is any additional useful information to be had from other tools (not even sure there are any out there...).

ifalconi20 commented 1 year ago

Thanks for the information c:

FelixKrueger / Bismark

Methylation analysis of Nanopore data #601