Magdoll / Cogent

Coding Genome Reconstruction using Iso-Seq data
BSD 3-Clause Clear License
60 stars 17 forks source link

STARlong compatibility #3

Closed sjin09 closed 7 years ago

sjin09 commented 7 years ago

The STARlong has been recommended as a better choice for alignment of IsoSeq than GMAP on the wiki (https://github.com/PacificBiosciences/cDNA_primer/wiki/Bioinfx-study:-Optimizing-STAR-aligner-for-Iso-Seq-data).

The Cogent and cDNA_cupcake pipeline seems to require GMAP sam file as input. I have tried using the STARlong sam file as input towards cDNA_cupake and it seems to run fine except for the last line of the sam file where it produces the error shown below.

Traceback (most recent call last):
  File "/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/bin/collapse_isoforms_by_sam.py", line 4, in <module>
    __import__('pkg_resources').run_script('cupcake==2.1', 'collapse_isoforms_by_sam.py')
  File "/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
  File "/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
  File "/mnt/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 242, in <module>
    main(args)
  File "/mnt/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 203, in main
    collapse_fuzzy_junctions(f_good.name, f_txt.name, args.allow_extra_5exon, internal_fuzzy_max_dist=args.max_fuzzy_junction)
  File "/mnt/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 149, in collapse_fuzzy_junctions
    _size = get_fl_from_id(group_info[pbid])
  File "/mnt/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in get_fl_from_id
    return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
  File "/mnt/ruby/PacBio/Project/AK1/ASIAN_GENOME/SangJin/01.AK1/python/envs/anaCogent/lib/python2.7/site-packages/cupcake-2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/collapse_isoforms_by_sam.py", line 92, in <genexpr>
    return sum(int(_id.split('/')[1].split('p')[0][1:]) for _id in members)
ValueError: invalid literal for int() with base 10: ''

Would there be any significant differences in using STARlong or GMAP for the Cogent and cDNA_cupcake pipeline?

Best, Jin

Magdoll commented 7 years ago

Hi,

Interesting. I would think STAR's SAM output should be the same as GMAP's and it would work. The error message displayed seemed to suggest the last line of SAM file may be corrupt or incomplete. What happens if you make a new SAM file that just has the last line deleted? If it works then, it's an indication your SAM file is incomplete and re-running STARlong again may be helpful.

If not, I can take a look at the SAM file to troubleshoot.

--Liz

sjin09 commented 7 years ago

Hello,

I have recreated and resorted the sam file. It has worked perfectly. I should have tried this first! It is good to know that cDNA_cupcake is compatible with STARlong!

Thank you so much for your help.

Best, Jin

gnetsanet commented 6 years ago

Hi,

I do not think the issue has to do with mapping or sam file not being sorted.

If you did the cluster part using tofu or pbtranscript.ice_partial, the fasta file - final.consensus.fasta - will have headers that resemble the one below.

>c262127/1/1651 isoform=c262127;full_length_coverage=1;isoform_length=1651
>c262128/1/512 isoform=c262128;full_length_coverage=1;isoform_length=512
>c262130/1/1174 isoform=c262130;full_length_coverage=1;isoform_length=1174
>c87379/1/2329 isoform=c87379;full_length_coverage=1;isoform_length=2329

The above ids (eg. c262127/1/1651 ) will make it to the sam files regardless of what you used for alignment, GMAP or STARlong.

Therefore, the following line in the collapse_fuzzy_junctions method in collapse_isoforms_by_sam.py will throw the above error since a 'p' literal is not in the ids

int(_id.split('/')[1].split('p')[0][1:])
>>> int('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''

Whereas, you would not see the issue if clustering as well as polishing is done via the SMRLink pipe and the input fasta is one like hq_isoforms.fasta with id that have 'p' for number of passes in them.

>i6_HQ_sample1888c8|c8/f2p0/6963 isoform=c8;full_length_coverage=2;non_full_length_coverage=0;isoform_length=6963
>i6_HQ_sample1888c8|c10/f2p0/6488 isoform=c10;full_length_coverage=2;non_full_length_coverage=0;isoform_length=6488
>i6_HQ_sample1888c8|c11/f2p0/7208 isoform=c11;full_length_coverage=2;non_full_length_coverage=0;isoform_length=7208
>i6_HQ_sample1888c8|c12/f2p0/6803 isoform=c12;full_length_coverage=2;non_full_length_coverage=0;isoform_length=6803
>i6_HQ_sample1888c8|c14/f2p0/6288 isoform=c14;full_length_coverage=2;non_full_length_coverage=0;isoform_length=6288 
Magdoll commented 6 years ago

Hi @gnetsanet ,

You are correct that having reads without the expected format, namely >c262127/1/1651 (an old format that is now deprecated) instead of >i0_HQ_sample1|c262127/f1p10/1651 (new, expected format) will cause collapse script to fail.

If you do have the old format, you will need to convert it to the new format by hacking the sequence IDs. This could be done with Python and BioPython library. Let me know if you need help with that.

--Liz