TreesLab / NCLscan

We have developed a new pipeline, NCLscan, which is rather advantageous in the identification of both intragenic and intergenic "non-co-linear" (NCL) transcripts (fusion, trans-splicing, and circular RNA) from paired-end RNA-seq data.
MIT License
6 stars 9 forks source link

novoalign fails to create index #22

Closed BarryDigby closed 4 years ago

BarryDigby commented 4 years ago

Hi there,

create_reference.py fails with the following error:

Error: Invalid NA code in /data/bdigby/NCLscan/reference/AllRef.fa at line 51696593.

Inspecting the line in question:

sed -n '51696593p' ../reference/AllRef.fa
>ENSP00000493376.2|ENST00000641515.2_2|ENSG00000186092.6_4|OTTHUMG00000001094.4_4|OTTHUMT00000003223.4_2|OR4F5-202|OR4F5|326|

This seems to be a novoalign problem, BWA creates the index without issue.

Please find attached the error log.

nclscan.err.txt

colinhercus commented 4 years ago

Hi Barry,

You need to look at the next line of the fasta as Novoindex line count is zero based.

Colin

chiangtw commented 4 years ago

Hi, Barry,

I think you might use the wrong file :sweatsmile:, you should use "gencode.v33lift37.pctranscripts.fa.gz", not "gencode.v33lift37.pc_translations.fa".

tw

chiangtw commented 4 years ago

@colinhercus Thanks for the information!

colinhercus commented 4 years ago

@BarryDigby At https://www.gencodegenes.org/human/release_33lift37.html

head -5 gencode.v33lift37.pc_translations.fa.gz

ENSP00000493376.2|ENST00000641515.2_2|ENSG00000186092.6_4|OTTHUMG00000001094.4_4|OTTHUMT00000003223.4_2|OR4F5-202|OR4F5|326| MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIV ITVVSDSHLHSPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHF FGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHL

You should be using ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh37_mapping/gencode.v33lift37.pc_transcripts.fa.gz

BarryDigby commented 4 years ago

Thank god for that, much less work for all involved ;)

thanks again gents