Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 102 forks source link

Calling variants using IsoPhase from Iso-Seq data #198

Open Yexin-Zhang opened 2 years ago

Yexin-Zhang commented 2 years ago

Hi,

I was trying to use IsoPhase do call SNPs from the Pacbio Isoseq data for a single sample. However, when I run the select_loci_to_phase.pywith all the files required, I was keeping getting 0 loci.

11      PacBio  transcript      63617628        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63617628        63619094        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63620065        63620182        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63622845        63623057        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63623737        63623954        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
11      PacBio  exon    63627298        63627350        .       -       .       transcript_id "PB.69.1"; gene_id "PB.69";
12      PacBio  transcript      20211007        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20211007        20211486        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20214033        20214764        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";
12      PacBio  exon    20215256        20216155        .       +       .       transcript_id "PB.76.2"; gene_id "PB.76";

It was very mysterious since when I used the exactly same reference genome and GFF file, but changed the FASTQ file to sample1..FL.RE.CL.hq.fasta(obtained after doing clustering, headers are formated as >transcript/0 full_length_coverage=4;length=5054), as well as changed the read_stat file into a new file converted using python <path_to>/phasing/utils/convert_group_to_read_stat_file.py, everything worked fine.

The new read_stat file is like:

id      pbid    is_fl   stat
transcript/304  PB.1.1  Y       unique
transcript/3111 PB.2.1  Y       unique
transcript/3713 PB.3.1  Y       unique
transcript/3605 PB.4.1  Y       unique
transcript/2086 PB.5.1  Y       unique
transcript/2546 PB.6.1  Y       unique
transcript/1800 PB.7.1  Y       unique

I was very confused, since I was hoping the phasing step would take the information of the full-length read sequences before clustering into isoform sequences. It would be very helpful if you could help with this problem. Thank you.

Monica

cpaisie commented 6 months ago

@Yexin-Zhang did you use the --fq flag in the command when you ran select_loci_to_phase.py? I had the same problem as you, but when I added --fq to the commands, everything ran as expected.