Teichlab / tracer

TraCeR - reconstruction of T cell receptor sequences from single-cell RNAseq data
Other
122 stars 48 forks source link

issue with build #78

Closed mhassan closed 5 years ago

mhassan commented 5 years ago

I have installed and successfully run the test. I am now trying to use 'tracer build' to create a new resource for my organism of choice (Btau). However, I keep getting the error I keep getting the error "assert os.path.isdir(resources_root), "Species not found in resources". I didn't expect it to find the spp since that's what am trying to create. Nevertheless I made a folder and move all the V, J, C, and D sequences in it, but still get this error. Do you have any idea how to go about it.

mstubb commented 5 years ago

Hi @mhassan,

Sorry about that. There was a bug (introduced by another change) that meant we always required a species directory even when using build. Please can you try the latest version from the master branch and let me know if it works for you now?

All the best,

Mike

mhassan commented 5 years ago

Hi Mike, Thanks, that resolved the issue. However, am now having a problem with igblastn. Initially there was an error with the species: igblast_species = species_mapper[species] KeyError: 'Btau

So I edited the species_wrapper in tracer_func.py to include "Btau": 'Bos_taurus'

Which clears the error but then igblastn produces this error:

CMD failed: /exports/mhassan/igBLAST/ncbi-igblast-1.9.0/bin/igblastn -germline_db_V ./bos/Btau/igblast_dbs/TCR_V.fa -germline_db_D ./bos/Btau/igblast_dbs/TCR_D.fa -germline_db_J ./bos/Btau/igblast_dbs/TCR_J.fa -domain_system imgt -organism Bos_taurus -ig_seqtype TCR -show_translation -num_alignments_V 5 -num_alignments_D 5 -num_alignments_J 5 -outfmt 3 -auxiliary_data optional_file/Bos_taurus_gl.aux -query /exports/mhassan/Tim/TCRs/A7/Trinity_output/A7_TCR_A.Trinity.fasta

USAGE igblastn [-h] [-help] [-germline_db_V germline_database_name] [-num_alignments_V int_value] [-germline_db_V_seqidlist filename] [-germline_db_D germline_database_name] [-num_alignments_D int_value] [-germline_db_D_seqidlist filename] [-germline_db_J germline_database_name] [-num_alignments_J int_value] [-germline_db_J_seqidlist filename] [-auxiliary_data filename] [-min_D_match min_D_match] [-D_penalty D_penalty] [-J_penalty J_penalty] [-num_clonotype num_clonotype] [-clonotype_out clonotype_out] [-allow_vdj_overlap] [-organism germline_origin] [-domain_system domain_system] [-ig_seqtype sequence_type] [-focus_on_V_segment] [-extend_align5end] [-min_V_length Min_V_Length] [-min_J_length Min_J_Length] [-show_translation] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-negative_seqidlist filename] [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename] [-negative_taxidlist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-sra accession] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-qcov_hsp_perc float_value] [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-line_length line_length] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version]

DESCRIPTION Nucleotide-Nucleotide BLAST for immunoglobulin sequences 2.8.0+

Use '-help' to print detailed descriptions of command line arguments

Error: Argument "organism". Illegal value, expected human',mouse', rabbit',rat', rhesus_monkey':Bos_taurus' Error: (CArgException::eConstraint) Argument "organism". Illegal value, expected human',mouse', rabbit',rat', rhesus_monkey':Bos_taurus'

I've tried the default --seq_method and 'assembly' but none seems to work.

Musa

mstubb commented 5 years ago

Ah. Sorry about this. I’ll be happy to help make this work. It’d be really helpful if you could send me the following things (either through here or directly through my email mjt.stubbington [at] gmail.com).

  1. The command that you used to build your Btau reference

  2. The command you’re running that gives the igblastn error below

  3. The reference sequences that you’re using as input to build

  4. FASTQ files for one of your cells so that I can test them as input once everything else is working

I think we made some breaking changes to build when updating the other functions and it wasn’t tested fully. Sorry you have to be the test-case but this is really useful and I should be able to get a working version to you soon.

All the best,

Mike

On 20 Sep 2018, at 23:39, mhassan notifications@github.com wrote:

Hi Mike, Thanks, that resolved the issue. However, am now having a problem with igblastn. Initially there was an error with the species: igblast_species = species_mapper[species] KeyError: 'Btau

So I edited the species_wrapper in tracer_func.py to include "Btau": 'Bos_taurus'

Which clears the error but then igblastn produces this error:

CMD failed: /exports/mhassan/igBLAST/ncbi-igblast-1.9.0/bin/igblastn -germline_db_V ./bos/Btau/igblast_dbs/TCR_V.fa -germline_db_D ./bos/Btau/igblast_dbs/TCR_D.fa -germline_db_J ./bos/Btau/igblast_dbs/TCR_J.fa -domain_system imgt -organism Bos_taurus -ig_seqtype TCR -show_translation -num_alignments_V 5 -num_alignments_D 5 -num_alignments_J 5 -outfmt 3 -auxiliary_data optional_file/Bos_taurus_gl.aux -query /exports/mhassan/Tim/TCRs/A7/Trinity_output/A7_TCR_A.Trinity.fasta

USAGE igblastn [-h] [-help] [-germline_db_V germline_database_name] [-num_alignments_V int_value] [-germline_db_V_seqidlist filename] [-germline_db_D germline_database_name] [-num_alignments_D int_value] [-germline_db_D_seqidlist filename] [-germline_db_J germline_database_name] [-num_alignments_J int_value] [-germline_db_J_seqidlist filename] [-auxiliary_data filename] [-min_D_match min_D_match] [-D_penalty D_penalty] [-J_penalty J_penalty] [-num_clonotype num_clonotype] [-clonotype_out clonotype_out] [-allow_vdj_overlap] [-organism germline_origin] [-domain_system domain_system] [-ig_seqtype sequence_type] [-focus_on_V_segment] [-extend_align5end] [-min_V_length Min_V_Length] [-min_J_length Min_J_Length] [-show_translation] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-negative_seqidlist filename] [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename] [-negative_taxidlist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-sra accession] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-qcov_hsp_perc float_value] [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-line_length line_length] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version]

DESCRIPTION Nucleotide-Nucleotide BLAST for immunoglobulin sequences 2.8.0+

Use '-help' to print detailed descriptions of command line arguments

Error: Argument "organism". Illegal value, expected human',mouse', rabbit',rat', rhesus_monkey':Bos_taurus' Error: (CArgException::eConstraint) Argument "organism". Illegal value, expected human',mouse', rabbit',rat', rhesus_monkey':Bos_taurus'

I've tried the default --seq_method and 'assembly' but none seems to work.

Musa

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Teichlab/tracer/issues/78#issuecomment-423357112, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFwhr8uAvLoZVIYSz-zp4aZSG1oSC98ks5udBkggaJpZM4WxE3O.

mstubb commented 5 years ago

Hi @mhassan,

I think this is fixed in the build_iss78 branch (https://github.com/Teichlab/tracer/tree/build_iss78). Please can you download this and see if it works for you? You shouldn't need to edit 'tracer_func.py' now.

A few notes:

  1. This was breaking because we added functionality to use an 'auxiliary file' from IgBLAST to detect CDR3s based on annotations of CDR locations. These files are available with IgBLAST but obviously don't exist for custom species. So, Tracer now won't attempt to do this for custom reference builds.

  2. Please could you manually check the first few assembled sequences to see if Tracer is reporting their 'productive' status correctly. Does Tracer call them 'productive' if they are in-frame/lack stop codons and vice versa? If not, our method for doing this may be too mouse/human-specific and will need revising.

  3. You'll need to supply a Bos taurus reference transcriptome and to add its path into your config file. I made one using the instructions at http://www.nxn.se/valent/2016/10/3/the-first-steps-in-rna-seq-expression-analysis-single-cell-and-other and you can download it from http://www.ensembl.org/biomart/martresults/248?file=martquery_0926102305_185.txt.gz. I know nothing about cow genomics though so this may not be optimal!

  4. In building Tracer references from your cow fasta files, I noticed the following:

Hope that's helpful. Please let me know how you get on and if you need anything else.

All the best,

Mike

mhassan commented 5 years ago

Thanks Mike for the tweaks. They seem to have resolved the initial issues but another issue came up, which seems to be an open issue raised by another user #65. "BLAST query/options error: Germline annotation database human/human_TR_V could not be found in [internal_data] directory Please refer to the BLAST+ user manual"

My initial thought was that the igblastn is looking in the wrong resource directory i.e. not the one in tracer.conf file, but that doesn't appear to be the case. Musa

mstubb commented 5 years ago

Well, I'm glad that the first part seems to be fixed.

Have you installed IgBLAST according to the instructions here by downloading the internal_data directory and setting your $IGDATA environment variable?

Does tracer test work or just crash in the same place?

mhassan commented 5 years ago

The igBLAST path was properly set, however, 'internal_data' and 'optional_file' needed to be in the bin sub-directory. Am wondering though why tracer needs to build salmon index even when a pre-made index is given. This seems to cause a fatal error in my case in which tracer stops with: "salmon quant was invoked improperly" error message

mstubb commented 5 years ago

Tracer builds a unique salmon index for each cell that includes the specific reconstructed TCR sequences. It does this by appending the reconstructed TCR sequences to the end of the reference transcriptome fasta file that’s specified in the config and then building the index. This enables it to quantify expression of the reconstructed sequences - this is then used for filtering in cases where more than two sequences are reconstructed for a locus.

I’ve not seen that salmon error before. Happy to help with it though. Please send me the full error output from tracer.

Thanks!

On 30 Sep 2018, at 10:42, mhassan notifications@github.com wrote:

The igBLAST path was properly set, however, 'internal_data' and 'optional_file' needed to be in the bin sub-directory. Am wondering though why tracer needs to build salmon index even when a pre-made index is given. This seems to cause a fatal error in my case in which tracer stops with: "salmon quant was invoked improperly" error message

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Teichlab/tracer/issues/78#issuecomment-425737888, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFwhm0x95_Oeo2mj3jpDWla-dZYiwLNks5ugQKMgaJpZM4WxE3O.

mhassan commented 5 years ago

here is the tracer run including the error message at the end.

Finding recombinant-derived reads

Resuming with existing TCR_A reads Resuming with existing TCR_B reads

Assembling Trinity Contigs

Resuming with existing Trinity output

Running IgBLAST

Running IgBLAST

Performing IgBlast on ['TCR_A', 'TCR_B']

TCR_A

TCR_B

No library type specified for salmon in the configuration file. Using automatic detection (--libType A). No kmer length specified for salmon in the configuration file. Using default value of 31.

Running Salmon

Making Salmon indices

Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains important bug fixes and improvements; please upgrade at your earliest convenience.

index ["/mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx"] did not previously exist . . . creating it [2018-09-30 18:33:30.945] [jLog] [info] building index RapMap Indexer

[Step 1 of 4] : counting k-mers counted k-mers for 20000 transcriptsElapsed time: 0.515421s

Replaced 544 non-ATCG nucleotides Clipped poly-A tails from 16 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 0.00164497s Writing sequence data to file . . . done Elapsed time: 0.0960587s [info] Building 32-bit suffix array (length of generalized text is 46217282) Building suffix array . . . success saving to disk . . . done Elapsed time: 0.143047s done Elapsed time: 5.08416s processed 19000000 positionstcmalloc: large alloc 1073741824 bytes == 0x57032000 @ processed 38000000 positionstcmalloc: large alloc 2147483648 bytes == 0x97032000 @ processed 46000000 positions khash had 40737272 keys saving hash to disk . . . done Elapsed time: 3.87736s [2018-09-30 18:34:09.634] [jLog] [info] done building index

Quantifying with Salmon

Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains important bug fixes and improvements; please upgrade at your earliest convenience.

salmon (mapping-based) v0.6.0

[ program ] => salmon

[ command ] => quant

[ index ] => { /mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx }

[ libType ] => { A }

[ threads ] => { 8 }

[ mates1 ] => { Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R1_val_1.fq.gz }

[ mates2 ] => { Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R2_val_2.fq.gz }

[ output ] => { /mhassan2/Tim/TCRs/A7/expression_quantification }

Logs will be written to /mhassan2/Tim/TCRs/A7/expression_quantification/logs Exception : [unknown library format string : A] salmon/v0.6.1-pre/bin/salmon quant was invoked improperly. For usage information, try salmon/v0.6.1-pre/bin/salmon quant --help Exiting. [2018-09-30 18:34:09.807] [jointLog] [info] parsing read library format Traceback (most recent call last): File "tracer/tracer", line 21, in launch() File "/mhassan2/tracer/tracerlib/launcher.py", line 43, in launch Task().run() File "/mhassan2/tracer/tracerlib/tasks.py", line 373, in run self.quantify(cell) File "/mhassan2/tracer/tracerlib/tasks.py", line 618, in quantify self.fragment_sd, salmon_libType, salmon_kmerLen) File "/mhassan2/tracer/tracerlib/tracer_func.py", line 1472, in quantify_with_salmon subprocess.check_call(salmon_command) File "python/2.7.10/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['salmon/v0.6.1-pre/bin/salmon', 'quant', '-i', '/mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx', '-l', 'A', '-p', '8', '-1', 'Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R1_val_1.fq.gz', '-2', 'Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R2_val_2.fq.gz', '-o', '/mhassan2/Tim/TCRs/A7/expression_quantification']' returned non-zero exit status 1

mstubb commented 5 years ago

Thanks. It looks like you’re running a really old version of Salmon (v0.6.1). Can you upgrade to a newer version?

M

On 30 Sep 2018, at 11:20, mhassan notifications@github.com wrote:

here is the tracer run including the error message at the end.

Finding recombinant-derived reads

Resuming with existing TCR_A reads Resuming with existing TCR_B reads

Assembling Trinity Contigs

Resuming with existing Trinity output

Running IgBLAST

Running IgBLAST

Performing IgBlast on ['TCR_A', 'TCR_B']

TCR_A

TCR_B

No library type specified for salmon in the configuration file. Using automatic detection (--libType A). No kmer length specified for salmon in the configuration file. Using default value of 31.

Running Salmon

Making Salmon indices

Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains important bug fixes and improvements; please upgrade at your earliest convenience.

index ["/mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx"] did not previously exist . . . creating it [2018-09-30 18:33:30.945] [jLog] [info] building index RapMap Indexer

[Step 1 of 4] : counting k-mers counted k-mers for 20000 transcriptsElapsed time: 0.515421s

Replaced 544 non-ATCG nucleotides Clipped poly-A tails from 16 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 0.00164497s Writing sequence data to file . . . done Elapsed time: 0.0960587s [info] Building 32-bit suffix array (length of generalized text is 46217282) Building suffix array . . . success saving to disk . . . done Elapsed time: 0.143047s done Elapsed time: 5.08416s processed 19000000 positionstcmalloc: large alloc 1073741824 bytes == 0x57032000 @ processed 38000000 positionstcmalloc: large alloc 2147483648 bytes == 0x97032000 @ processed 46000000 positions khash had 40737272 keys saving hash to disk . . . done Elapsed time: 3.87736s [2018-09-30 18:34:09.634] [jLog] [info] done building index

Quantifying with Salmon

Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains important bug fixes and improvements; please upgrade at your earliest convenience.

salmon (mapping-based) v0.6.0

[ program ] => salmon

[ command ] => quant

[ index ] => { /mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx }

[ libType ] => { A }

[ threads ] => { 8 }

[ mates1 ] => { Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R1_val_1.fq.gz }

[ mates2 ] => { Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R2_val_2.fq.gz }

[ output ] => { /mhassan2/Tim/TCRs/A7/expression_quantification }

Logs will be written to /mhassan2/Tim/TCRs/A7/expression_quantification/logs Exception : [unknown library format string : A] salmon/v0.6.1-pre/bin/salmon quant was invoked improperly. For usage information, try salmon/v0.6.1-pre/bin/salmon quant --help Exiting. [2018-09-30 18:34:09.807] [jointLog] [info] parsing read library format Traceback (most recent call last): File "tracer/tracer", line 21, in launch() File "/mhassan2/tracer/tracerlib/launcher.py", line 43, in launch Task().run() File "/mhassan2/tracer/tracerlib/tasks.py", line 373, in run self.quantify(cell) File "/mhassan2/tracer/tracerlib/tasks.py", line 618, in quantify self.fragment_sd, salmon_libType, salmon_kmerLen) File "/mhassan2/tracer/tracerlib/tracer_func.py", line 1472, in quantify_with_salmon subprocess.check_call(salmon_command) File "python/2.7.10/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['salmon/v0.6.1-pre/bin/salmon', 'quant', '-i', '/mhassan2/Tim/TCRs/A7/expression_quantification/salmon_index/A7_transcriptome.idx', '-l', 'A', '-p', '8', '-1', 'Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R1_val_1.fq.gz', '-2', 'Tim/CowSeq_2053_G9_11d_PC_cDNA_A7_PRO1834_S1_libDNA_TAGCGCTC-ATAGAGAG_L006_R2_val_2.fq.gz', '-o', '/mhassan2/Tim/TCRs/A7/expression_quantification']' returned non-zero exit status 1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mhassan commented 5 years ago

Probably was an issue with the salmon version. Though haven't tested it yet, reverting to kallisto (for which I have the latest version) worked without any errors or warnings. Thanks.

mstubb commented 5 years ago

Great. I’ll merge the branch that fixed the build errors into master and then I’ll close this issue.

M

On 30 Sep 2018, at 12:00, mhassan notifications@github.com wrote:

Probably was an issue with the salmon version. Though haven't tested it yet, reverting to kallisto (for which I have the latest version) worked without any errors or warnings. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mhassan commented 5 years ago

Thanks for the help Mike.

mstubb commented 5 years ago

You're welcome. Hope you find out some cool things to do with cow TCRs!