Calling variants using IsoPhase from Iso-Seq data

Hi,

I was trying to use IsoPhase do call SNPs from the Pacbio Isoseq data for a single sample. However, when I run the select_loci_to_phase.pywith all the files required, I was keeping getting 0 loci.

The GFF file is obtained after running SQANTI3 filtering

11      PacBio  transcript      63617628        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63617628        63619094        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63620065        63620182        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63622845        63623057        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63623737        63623954        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63627298        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
12      PacBio  transcript      20211007        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20211007        20211486        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20214033        20214764        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20215256        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";

The FASTQ file is obtained after doing CCS analyses - LIMA - Isoseq refine, but before clustering. The headers of each sequence are in the format @m64319e_211227_144417/3211320/ccs.

The read_stat file is the count information after collapse obtained by running get_abundance_post_collapse.py

id      is_fl   stat    pbid
m64319e_211227_144417/3211320/ccs       Y       unique  PB.1.1
m64319e_211227_144417/65145123/ccs      Y       unique  PB.1.1
m64319e_211227_144417/111935771/ccs     Y       unique  PB.2.1
m64319e_211227_144417/67240066/ccs      Y       unique  PB.2.1
m64319e_211227_144417/123208230/ccs     Y       unique  PB.3.1
m64319e_211227_144417/160236121/ccs     Y       unique  PB.3.1
m64319e_211227_144417/42860688/ccs      Y       unique  PB.3.1

I have also manually checked that the coverage at some loci is over 40, even some isoforms (e.g. P.B.78.13) are having support from more than 40 reads.

It was very mysterious since when I used the exactly same reference genome and GFF file, but changed the FASTQ file to sample1..FL.RE.CL.hq.fasta(obtained after doing clustering, headers are formated as >transcript/0 full_length_coverage=4;length=5054), as well as changed the read_stat file into a new file converted using python <path_to>/phasing/utils/convert_group_to_read_stat_file.py, everything worked fine.

The new read_stat file is like:

id      pbid    is_fl   stat
transcript/304  PB.1.1  Y       unique
transcript/3111 PB.2.1  Y       unique
transcript/3713 PB.3.1  Y       unique
transcript/3605 PB.4.1  Y       unique
transcript/2086 PB.5.1  Y       unique
transcript/2546 PB.6.1  Y       unique
transcript/1800 PB.7.1  Y       unique

I was very confused, since I was hoping the phasing step would take the information of the full-length read sequences before clustering into isoform sequences. It would be very helpful if you could help with this problem. Thank you.

Monica

Magdoll / cDNA_Cupcake

Calling variants using IsoPhase from Iso-Seq data #198