churchill-lab / kallisto-align

Export kallisto pseudo-alignments in a sparse binary matrix format
http://churchill-lab.github.io/kallisto-align/
GNU General Public License v2.0
5 stars 2 forks source link

kallisto index error #1

Open inti opened 5 years ago

inti commented 5 years ago

Hi I am getting the following error I build the index with the kallisto provided with kallisto-align and also installing it with conda on a separate environment. On both cases a get the following erro.

[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0

Many thanks in advance

kbchoi-jax commented 5 years ago

Hi Inti,

Try to build kallisto index using its older version (https://pachterlab.github.io/kallisto/download) like v0.42.1. They upgraded its indexing to version 9 at some point but our kallisto-align uses version 8. We will catch up with it at some point (I am considering to merge it to alntools) but not soon unfortunately. Thanks for using kallisto-align.

KB

inti commented 5 years ago

Hi, thanks for the response. That did not work. I got the same error as before. I built the index with kallisto v0.42.1 then with `kallisto-align``

bash-4.2$ ~/app/kallisto-align/kallisto-align -i emase/SRR5125117/SRR5125117.k_idx -f fastq/SRR5125117_1.fastq.gz -b my_sample.bin
[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0
Rerun with index to regenerate
inti commented 5 years ago

Hi, I had done it previously with kallisto (v0.42.1) Sorry I did not send the full code I ran. Here I am sending the output of building the index and trying to run kallisto-align I also tried with the kallisto v0.42 and had the same error

bash-4.2$ ~/app/kallisto_linux-v0.42.1/kallisto index -i emase/SRR5125117/SRR5125117.k_idx emase/SRR5125117/SRR5125117.transcripts.fa

[build] loading fasta file emase/SRR5125117/SRR5125117.transcripts.fa
[build] k-mer length: 31
[build] warning: replaced 14045 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 2065 contigs and contains 185230 k-mers

bash-4.2$  ~/app/kallisto-align/kallisto-align -i emase/SRR5125117/SRR5125117.k_idx -f fastq/SRR5125117_1.fastq.gz -b my_sample.bin
[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0
Rerun with index to regeneratebash-4.2$
kbchoi-jax commented 5 years ago

I am sorry, I meant you should try if kallisto quant works fine with the same input files on v0.42.1.

kallisto 0.42.1
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --single                  Quantify single-end reads
-l, --fragment-length=DOUBLE  Estimated average fragment length
                              (default: value is estimated from the input data)
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
inti commented 5 years ago

I did try

bash-4.2$ ~/app/kallisto_linux-v0.42.1/kallisto index -i emase/SRR5125117/SRR5125117.transcripts.k_idx emase/SRR5125117/SRR5125117.transcripts.fa

[build] loading fasta file emase/SRR5125117/SRR5125117.transcripts.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 182 target sequences
[build] warning: replaced 78 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 84409 contigs and contains 21292011 k-mers

bash-4.2$  ~/app/kallisto_linux-v0.42.1/kallisto quant -i emase/SRR5125117/SRR5125117.transcripts.k_idx -o test fastq/SRR5125117_1.fastq.gz fastq/SRR5125117_2.fastq.gz

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 29493
[index] number of k-mers: 21292011
[index] number of equivalence classes: 61508
[quant] running in paired-end mode
[quant] will process pair 1: fastq/SRR5125117_1.fastq.gz
                             fastq/SRR5125117_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 0 reads, 0 reads pseudoaligned
[quant] estimated average fragment length: -nan
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1 rounds

it does not work (all transcripts have 0 counts) ... :/ it does not work with the same files and teh newest version of kallisto (v0.44). Neither it works with the ref transcriptome Bombus_terrestris.Bter_1.0.cdna.all.fa I have used kallisto recently, so this is odd and I did not expected it Not sure what is going on ...

kbchoi-jax commented 5 years ago

Anyways it seems that your issue is not due to our kallisto-align. Take a look at your transcripts.fa file.

inti commented 5 years ago

Sorry ... I had use kallisto recently, so did not expect the issue would be there. Apologies again

inti commented 5 years ago

1

Using prepare-emase to generate diploid transriptome using as input the SRR5125117.gtf and SRR5125117.fa generated with g2gtools

grep "_R" SRR5125117.gtf > SRR5125117.R.gtf
grep "_L" SRR5125117.gtf > SRR5125117.L.gtf
prepare-emase -G SRR5125117.fa,SRR5125117.fa -g SRR5125117.L.gtf,SRR5125117.R.gtf -s L,R -o test -m -x
sed -i "s/_R_R/_R/g" test/emase.pooled.transcripts.info
sed -i "s/_L_L/_L/g" test/emase.pooled.transcripts.info
sed -i "s/_R_R/_R/g" test/emase.pooled.transcripts.fa
sed -i "s/_L_L/_L/g" test/emase.pooled.transcripts.fa

2 Build index

~/app/kallisto_linux-v0.42.1/kallisto index -i test/emase.pooled.transcripts.k_idx test/emase.pooled.transcripts.fa

[build] loading fasta file test/emase.pooled.transcripts.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 472 target sequences
[build] warning: replaced 78 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 44839 contigs and contains 20829208 k-mers

3 quant step

~/app/kallisto_linux-v0.42.1/kallisto quant -i test/emase.pooled.transcripts.k_idx -o test_kallisto ../../fastq/SRR5125122_1.fastq.gz ../../fastq/SRR5125122_2.fastq.gz

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 29496
[index] number of k-mers: 20829208
[index] number of equivalence classes: 56014
[quant] running in paired-end mode
[quant] will process pair 1: ../../fastq/SRR5125122_1.fastq.gz
                             ../../fastq/SRR5125122_2.fastq.gz
[quant] finding pseudoalignments for the reads ...
 done
[quant] processed 23476745 reads, 12434421 reads pseudoaligned
[quant] estimated average fragment length: 156.926
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1088 rounds

4 quant output

head test_kallisto/abundance.txt
target_id   length  eff_length  est_counts  tpm
ENSRNA049756373-T1_L    91  91  0.5 0.939
ENSRNA049756376-T1_L    86  86  1.5 2.98078
ENSRNA049756377-T1_L    101 101 0   0
ENSRNA049756378-T1_L    119 119 0   0
ENSRNA049756379-T1_L    141 141 0   0
ENSRNA049756380-T1_L    92  92  0   0
ENSRNA049756381-T1_L    103 103 0   0
ENSRNA049756382-T1_L    164 8.07353 0   0
ENSRNA049756383-T1_L    155 155 23  25.3591

5 Trying kallisto-align

~/app/kallisto-align/kallisto-align -i test/emase.pooled.transcripts.k_idx -f ../../fastq/SRR5125122_1.fastq.gz ../../fastq/SRR5125122_2.fastq.gz -b my_sample.bin
[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0
Rerun with index to regenerate%

6 try building index with the kallisto distributed with kallisto-align

/home/ipedroso/app/kallisto-align/external/src/kallisto-build/src/kallisto index -i emase.pooled.transcripts.kOld_index emase.pooled.transcripts.fa

[build] loading fasta file emase.pooled.transcripts.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 472 target sequences
[build] warning: replaced 78 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 44839 contigs and contains 20829208 k-mers

$/home/ipedroso/app/kallisto-align/external/src/kallisto-build/src/kallisto quant -i emase.pooled.transcripts.kOld_index -o k_old ../../../fastq/SRR5125122_1.fastq.gz ../../../fastq/SRR5125122_2.fastq.gz

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 29496
[index] number of k-mers: 20829208
[index] number of equivalence classes: 56014
[quant] running in paired-end mode
[quant] will process pair 1: ../../../fastq/SRR5125122_1.fastq.gz
                             ../../../fastq/SRR5125122_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 23476745 reads, 12434421 reads pseudoaligned
[quant] estimated average fragment length: 156.926
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1088 rounds

$head k_old/abundance.txt
target_id   length  eff_length  est_counts  tpm
ENSRNA049756373-T1_L    91  91  0.5 0.939
ENSRNA049756376-T1_L    86  86  1.5 2.98078
ENSRNA049756377-T1_L    101 101 0   0
ENSRNA049756378-T1_L    119 119 0   0
ENSRNA049756379-T1_L    141 141 0   0
ENSRNA049756380-T1_L    92  92  0   0
ENSRNA049756381-T1_L    103 103 0   0
ENSRNA049756382-T1_L    164 8.07353 0   0
ENSRNA049756383-T1_L    155 155 23  25.3591

7 kallisto-align with the new index

~/app/kallisto-align/kallisto-align -i test/emase.pooled.transcripts.kOld_index -f ../../fastq/SRR5125122_1.fastq.gz ../../fastq/SRR5125122_2.fastq.gz -b my_sample.bin
[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0
Rerun with index to regenerate%

I apologise again for whatever shambles or mistakes I did previously. kallisto is working fine, as expected I guess and as I commented I had used it before.

Both the kallisto you distribute with kallisto-align and the one I downloaded are v0.42.1

I am happy to send along or upload somewhere the transcriptome and fastq files if that helps to work out that is going on ...

Thanks again for your help on this!

kbchoi-jax commented 5 years ago

That error message is coming from kallisto and literally saying your index does not match the version for some reason. Try to build kallisto index using /home/ipedroso/app/kallisto-align/external/src/kallisto-build/src/kallisto index.

kbchoi-jax commented 5 years ago

And the following does not look right because you are providing a same fasta file for L and R. Usually you should provide L.fa and R.fa. Is SRR5125117.fa diploid genome you created with g2gtools?

$ prepare-emase -G SRR5125117.fa,SRR5125117.fa -g SRR5125117.L.gtf,SRR5125117.R.gtf -s L,R -o test -m -x

If SRR5125117.fa is diploid, I think you should be able to simply do the following.

$ prepare-emase -G SRR5125117.fa -g SRR5125117.gtf -o test -m -x

inti commented 5 years ago

Hi, On the example above i did build the index with /home/ipedroso/app/kallisto-align/external/src/kallisto-build/src/kallisto index see number 6 on the message above.

Regarding prepare-emase, yes SRR5125117.fa is the diploid genome generaed by g2gtools. I just tried to replicate the emase protocol which has separate files for each haplotype.

Here is the test. It does not seem to make a difference

$ prepare-emase -G SRR5125117.fa -g SRR5125117.gtf -o test2 -m -x

$ /home/ipedroso/app/kallisto-align/external/src/kallisto-build/src/kallisto index -i test2/emase.transcripts.k_idx test2/emase.transcripts.fa

[build] loading fasta file test2/emase.transcripts.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 472 target sequences
[build] warning: replaced 78 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 44835 contigs and contains 20829133 k-mers

$ ~/app/kallisto-align/kallisto-align -i test2/emase.transcripts.k_idx -f ../../fastq/SRR5125122_1.fastq.gz ../../fastq/SRR5125122_2.fastq.gz -b my_sample.bin
[kallisto-align] Creating my_sample.bin...
Error: incompatible indices. Found version 9, expected version 0
Rerun with index to regenerate%

Previously you say kallisto currently uses index version 9 and kallisto-align uses version 8. However, the message says it expects version 0 of the index. Is that correct? If I send you the transcriptome index, would you try to replicate the error?

Many thanks again

inti commented 5 years ago

quick question. What does kallisto-align actually do? If I run kallisto generate a pseudobam file and convert it into a emase-binary format with alntools, would that replace kallisto-align?

kbchoi-jax commented 5 years ago

You are right, you can convert kallisto pseudobam into emase binary file and run emase-zero. But kallisto-align does it way faster. The kallisto that we carry should not create Version 9 index.

inti commented 5 years ago

Let me know if there is anything I can do to hlep debug this. i will try the long side-path to test the g2gtools + emase-zero pipeline.

Thanks a lot again and sorry for the initial confusion

inti commented 5 years ago

Hi, Any updates on this issue? would love to use kallisto-align.

Regarding:

You are right, you can convert kallisto pseudobam into emase binary file and run emase-zero. But kallisto-align does it way faster. The kallisto that we carry should not create Version 9 index.

What would be the equivalent steps: kallisto [fastq -> pseudobam] => alntools [pseudobam -> bin-emase] => emase-zero [awesome results]

Do you do local alignment of the reads to the transcripts? I understand the pseudobam does not really align to the read but rather assign it to the read and make up a cigar string. Perhaps really the question is whether emase-zero needs alignments or it can do with read-trancript assignment?

Thanks in advance