TreesLab / NCLscan

We have developed a new pipeline, NCLscan, which is rather advantageous in the identification of both intragenic and intergenic "non-co-linear" (NCL) transcripts (fusion, trans-splicing, and circular RNA) from paired-end RNA-seq data.
MIT License
6 stars 9 forks source link

Not reading files from the dirctory #12

Closed archu87 closed 5 years ago

archu87 commented 6 years ago

Hi, I installed NCLScan-1.6 with all desired dependencies.

All files are in the folder NCLScan-1.6 excluding bwa, bedtools, blat, samtools. All these files are installed in /usr/bin/

I specified the path as : not found cat: '/home/archana87/NCLscan-1.6/GRCh37.p13.genome.fa'$'\r': No such file or directory cat: '/home/archana87/NCLscan-1.6/GRCh37.p13.genome.fa'$'\r': No such file or directory cat: '/home/archana87/NCLscan-1.6/gencode.v19.pc_transcripts.fa'$'\r': No such file or directory cat: '/home/archana87/NCLscan-1.6/gencode.v19.lncRNA_transcripts.fa'$'\r': No such file or directory : not found/bin/bwa : not founde/archana87/NCLscan-1.6/novocraft/novoindex

Note: It recognizing the /us/bin/samtools **/usr/bin/bedtools etc. but not /us/bin/bwa Similarly for novoalign.

Could you please suggest something on this. Thanks

chiangtw commented 6 years ago

Hi, I think you should check the paths in your "NCLscan.config" file, Are the file and the tool paths all correct?

tw

archu87 commented 6 years ago

Thanks for you reply. Please see the NCLscan.config file details which I am using. Any help is much appreciated.

#############################

#############################

The directory of NCLscan

NCLscan_dir =

The directory of references and indices

The script "create_reference.py" would create the needed references and indices here.

NCLscan_ref_dir =/home/archana87/NCLscan-1.6

The following four reference files can be downloaded from the GENCODE website (http://www.gencodegenes.org/).

The reference genome sequence, eg. /path/to/GRCh37.p13.genome.fa

Reference_genome = /home/archana87/NCLscan-1.6/GRCh37.p13.genome.fa

The gene annotation file, eg. /path/to/gencode.v19.annotation.gtf

Gene_annotation = /home/archana87/NCLscan-1.6/gencode.v19.annotation.gtf

The protein-coding transcript sequences, eg. /path/to/gencode.v19.pc_transcripts.fa

Protein_coding_transcripts = /home/archana87/NCLscan-1.6/gencode.v19.pc_transcripts.fa

The long non-coding RNA transcript sequences, eg. /path/to/gencode.v19.lncRNA_transcripts.fa

lncRNA_transcripts = /home/archana87/NCLscan-1.6/gencode.v19.lncRNA_transcripts.fa

External tools

bedtools_bin = /usr/bin/bedtools blat_bin = /home/archana87/blat bwa_bin = /usr/bin/bwa samtools_bin = /usr/bin/samtools novoalign_bin = /home/archana87/NCLscan-1.6/novocraft/novoalign novoindex_bin = /home/archana87/NCLscan-1.6/novocraft/novoindex

Bin

NCLscan_bin = {NCLscan_dir}/bin

Add_read_count_bin = {NCLscan_bin}/Add_read_count.py AssembleExons_bin = {NCLscan_bin}/AssembleExons AssembleFastq_bin = {NCLscan_bin}/AssembleFastq AssembleJSeq_bin = {NCLscan_bin}/AssembleJSeq.py FastqOut_bin = {NCLscan_bin}/FastqOut get_gene_name_bin = {NCLscan_bin}/get_gene_name.py GetInfo_bin = {NCLscan_bin}/GetInfo GetKey_bin = {NCLscan_bin}/GetKey GetNameB4Dot_bin = {NCLscan_bin}/GetNameB4Dot InsertInList_bin = {NCLscan_bin}/InsertInList JSFilter_bin = {NCLscan_bin}/JSFilter JSParser_bin = {NCLscan_bin}/JSParser JunctionSite2BED_bin = {NCLscan_bin}/JunctionSite2BED mp_blat_bin = {NCLscan_bin}/mp_blat.py PslChimeraFilter_bin = {NCLscan_bin}/PslChimeraFilter RemoveInList_bin = {NCLscan_bin}/RemoveInList RetainInList_bin = {NCLscan_bin}/RetainInList RmBadMapping_bin = {NCLscan_bin}/RmBadMapping RmColinearPairInSam_bin = {NCLscan_bin}/RmColinearPairInSam RmRedundance_bin = {NCLscan_bin}/RmRedundance SeqOut_bin = {NCLscan_bin}/SeqOut

###########################

Advanced parameters

###########################

The following two parameters indicate the maximal read length (L) and fragment size of the used paired-end RNA-seq data (FASTQ files), where fragment size = 2L + insert size.

If L > 151, the users should change these two parameters to (L, 2L + insert size).

max_read_len = 151 max_fragment_size = 500

The base quality threshold. The value should be a non-negative integer.

quality_score = 20

The collection of the supporting reads must span the NCL junction boundary by the setting size of span range on both sides of the junction site.

span_range = 50

###################

Performance

###################

Parameters for bwa mem

The number of threads

bwa-mem-t = 1

Parameters for mp_blat.py

The number of processes for running blat

NOTE: The memory usage of each blat process would be up to 4 GB!

mp_blat_process = 1

chiangtw commented 6 years ago

Hi, I notice one thing in your first post, there are $'\r'(Windows style new line character) in your paths, I think that might be the cause to this problem! Perhaps you copy and paste the paths under Windows system ?

You can try the following commands to it:

> cat NCLscan.config | sed -r 's/\r//g' > new_NCLscan.config

Thanks!

tw

archu87 commented 6 years ago

Thanks alot for your help. It solved my problem but now I am getting error like this after running

./bin/create_reference.py -c new_NCLscan.config ........................ Error: Duplicate FASTA sequence chrM in file /home/archana87/NCLscan-1.6/AllRef.fa.

When I counted it in AllRef.fa files; I got the result like this.. AllRef.fa:2 AllRef.fa.amb:0 AllRef.fa.ann:2 AllRef.fa.bwt:0 AllRef.fa.pac:0 AllRef.fa.sa:0 AllRef.ndx:0

Also, when I cross checked the file size I got result like this 3.3G AllRef.fa 16K AllRef.fa.amb 18M AllRef.fa.ann 3.3G AllRef.fa.bwt 1.7G AllRef.fa.pac 2.2G AllRef.fa.sa 0 AllRef.ndx

When I running the sample files not getting any result. Could you please suggest me something?

Any help is much appreciated. Thanks

chiangtw commented 6 years ago

Hi, Thanks for reporting!

This error is reported by novoindex due to the duplicate sequence name 'chrM' in AllRef.fa. You may quickly ignore this problem by downgrading your novocraft to version V3.07.00 or before.

Since from NovocraftV3.07.01, the novoindex would detect for duplicate sequence names and then stop the process.

We will fix this in the future commit.

Thanks!

tw

archu87 commented 6 years ago

Thanks for your suggestion but I tried two different version of novocraft

  1. novocraftV3.07.00.Linux2.6.tar.gz
  2. novocraftV3.06.05.Linux3.0.tar.gz now I am getting error like this

    novoindex /home/archana87/NCLscan-1.6/AllRef.ndx /home/archana87/NCLscan-1.6/AllRef.fa

    Creating 4 indexing threads.

    Building with 14-mer and step of 2 bp.

Interrupted...11 Obtained 4 stack frames. [0x400e76] [0x400efc] [0x4912bf] [0xffffffffff600000]

Can you please tell me the specific version of novocraft in which you tested this NCLScan? Any help is much appreciated. Thanks

chiangtw commented 6 years ago

Hi, Did novoindex interrupt immediately? Maybe you could try novoindex on an empty file first.

The followings are the versions I had tested on:

and also:

Thanks!

tw

archu87 commented 6 years ago

Hi, Thanks for your support.

I think its license (trial one) issue now because when I tested novoalign its giving same error with these two versions novocraftV3.07.00.Linux2.6.tar.gz novocraftV3.06.05.Linux3.0.tar.gz

but when I tested with previous version it was working. Could you please tell me that the same licesene I can use for all version of novocraft?

Or could you please fix the bug with latest vesion of novocraft?

You helped me alot to correct the errors. Thanks

chiangtw commented 6 years ago

Hi, With the newly uploaded "bin/create_reference.py" , the error should be fixed.

Thanks!

tw

archu87 commented 6 years ago

Hi, I tried the newly uploaded create_reference.py file and now I am getting the error like this after running ./NCLscan.py -c new_NCLscan.config -pj test_NCLscan -o output --fq1 simu_5X_100PE_1.fastq --fq2 simu_5X_100PE_1.fastq

Reading annotations on chr22. Reading annotations on chrX. Reading annotations on chrY. Reading annotations on chrM. Read 57820 genes, 196520 transcripts and 1196293 exons from the gtf file.

novoindex (3.9) - Universal k-mer index constructor.

(C) 2008 - 2011 NovoCraft Technologies Sdn Bhd

novoindex output/test_NCLscan.JS2.ndx output/test_NCLscan.JS2.fa

Creating 4 indexing threads.

Building with 9-mer and step of 1 bp.

novoindex construction dT = 0.0s

Index memory size 0.001Gbyte.

Done.

novoalign (AVX2) (V3.09.00 - Build Jun 12 2018 @ 08:54:05) - A short read aligner with qualities.

(C) 2008-2016 Novocraft Technologies Sdn Bhd.

License file: /home/archana87/NCLscan-1.6/novocraft/novoalign.lic

Licensed to University of Saskatchewan

novoalign -r A 1 -t 0,1 -d output/test_NCLscan.JS2.ndx -f output/test_NCLscan.main.unmapped_1.fastq output/test_NCLscan.main.unmapped_2.fastq --3Prime -o SAM

Starting at Sat Jul 7 16:11:33 2018

Interpreting input files as Sanger FASTQ.

Error: output/test_NCLscan.JS2.ndx does not appear to be a valid novoindex. Code 9 End of file reading 4 bytes Total time cost = 0.134050130844 sec End of file reading 4 bytes Total time cost = 0.0874629020691 sec End of file reading 4 bytes Total time cost = 0.0724909305573 sec End of file reading 4 bytes Total time cost = 0.077999830246 sec Traceback (most recent call last): File "/home/archana87/NCLscan-1.6/bin/Add_read_count.py", line 118, in add_read_count(args.result_tmp_file, args.result_sam_file, args.output, args.JSParser_bin) File "/home/archana87/NCLscan-1.6/bin/Add_read_count.py", line 13, in add_read_count all_junc_read_with_ref = get_junc_read(result_sam_data, JSParser_bin) File "/home/archana87/NCLscan-1.6/bin/Add_read_count.py", line 66, in get_junc_read junc_read_data = get_read_with_ref(junc_read_sam_data) File "/home/archana87/NCLscan-1.6/bin/Add_read_count.py", line 46, in get_read_with_ref ref_id = re.sub(".[0-9]*$", "", line[2]) IndexError: list index out of range Traceback (most recent call last): File "/home/archana87/NCLscan-1.6/bin/get_gene_name.py", line 91, in add_gene_name(args.result_tmp_file, args.gene_anno, args.output) File "/home/archana87/NCLscan-1.6/bin/get_gene_name.py", line 8, in add_gene_name result_tmp_data = read_TSV(result_tmp_file) File "/home/archana87/NCLscan-1.6/bin/get_gene_name.py", line 64, in read_TSV with open(tsv_file) as data_reader: IOError: [Errno 2] No such file or directory: 'output/test_NCLscan.result.tmp2' Traceback (most recent call last): File "./NCLscan.py", line 448, in NCL_Scan4(config, datasets_list, args.project_name, args.output_dir) File "./NCLscan.py", line 255, in NCL_Scan4 final_tmp = read_TSV("{prefix}.result.tmp3".format(**config_options)) File "./NCLscan.py", line 279, in read_TSV with open(tsv_file) as data_reader: IOError: [Errno 2] No such file or directory: 'output/test_NCLscan.result.tmp3'

Could you please suggest some solution; why I am getting error like this. Any help is much appreciated. Thanks

chiangtw commented 6 years ago

Hi, Thanks for reporting! I just find out that the output format of bedtools getfasta has some changes since v2.26.0, and that would lead the pipeline to fails.

To fix this, please update the script bin/AssembleJSeq.py from the latest commit.

Thanks!

tw

archu87 commented 6 years ago

Hi, Thanks. I will try today and hoping I will get correct result.

Thanks again for your support.

archu87 commented 6 years ago

error_file_nclscan.txt Please see the attache file.

Any help is much appreciated. Thanks

archu87 commented 6 years ago

Hi, I am so sorry to disturb you. But I finally got the error in my input file this time. Now I am sure this time I will get the result.

Thanks

archu87 commented 6 years ago

Hi,

I got the result. Thanks a lot for your continuous support.