PierreBSC / Viral-Track

MIT License
54 stars 27 forks source link

Bam error #5

Closed loverend closed 4 years ago

loverend commented 4 years ago

Hi - Thanks for the great pipeline. I have been trying to run the first R script having worked through the UMI extraction etc. I am getting the following errors:

module load STAR/2.7.1a-foss-2018b module load UMI-tools/1.0.0-foss-2018b-Python-3.6.6 module load StringTie/2.1.0-foss-2018b module load SAMtools/0.1.20-foss-2018b module load R-bundle-Bioconductor/3.9-foss-2018b-R-3.6.0

Rscript VIRUS/VIRAL_TRACK/Viral_Track_scanning.R /well/immune-rVIRUS/VIRAL_TRACK/Parameters.txt VIRUS/VIRAL_TRACK/Files_to_process.txt

Loading of the libraries.... ... done ! Parameter loading and checkingParameter loading and checkingSTARSecond step : Target Fastq file identification2 Fastq files are going to be processed ! Third step : Mapping by itselfFor each Fastq file a sub-directory is createdMapping SRR8427168_S1_L001_R1_001_extracted.fastq file Mapping ofSRR8427168_S1_L001_R1_001_extracted.fastq done ! Mapping SRR8291066_S1_L001_R1_001_extracted.fastq file Mapping ofSRR8291066_S1_L001_R1_001_extracted.fastq done ! All fastq files have been mapped successfully Starting the BAM file analysis [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_index_core] Invalid BAM header.[bam_index_build2] fail to index the BAM file. Indexing of the bam file for SRR8427168_S1_L001_R1_001_extracted is done [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_index_load] fail to load BAM index. [bam_idxstats] fail to load the index. Computing stat file for the bam file for SRR8427168_S1_L001_R1_001_extracted is done Error in read.table(temp_chromosome_count_path, header = F, row.names = 1) : no lines available in input Execution halted

Please could you advise as to whats going on? I wonder if its to do with the reference I built - using an existing genomes.fa file available on our cluster.

Best wishes, Lauren

polojacky commented 4 years ago

Have you tried the newer version of samtools?

PierreBSC commented 4 years ago

Hi Lauren,

There is a potential issue in the log file you sent. It seems that no mapping has really taken place. Can you check if any reasonable BAM file is produced ?

loverend commented 4 years ago

Hi Pierre,

I don't think any mapping really took place as it just seemed to run over this script in 30 seconds and yield errors. Could you advise exactly what version of the different tools you used? I am working on a cluster and they may have installed different modules.

SRR8427168_S1_L001_R1_001_extracted_Aligned.sortedByCoord.out.bam - This file is completely empty.

Thanks for your help, Lauren

PierreBSC commented 4 years ago

Hi Lauren,

Ok so this is probably due to the fact that when you call R, it is not able to load the STAR and samtools module. It is a rather bad situation but it can be solved. First modify the parameter file : set the parameter Load_STAR_module and Load_samtools_module to TRUE. Then you will have to modify the Viral_Track_scanning.R. First find the path to STAR and samtools. To do so type : module load STAR which STAR and module load samtools which samtools It should look something like this for STAR: /opt/ohpc/pub/libs/star/2.7.0e/bin/Linux_x86_64_static Then replace the paths at line 192 and 197 in the script, it should do the job (it worked on a Centos7 cluster).

Hope this will help !

loverend commented 4 years ago

Hi Pierre: Thanks for the speedy reply. I have actually generated a job script for this which loads our builds for the various modules in advance (see below) so I am assuming it will be able to find the modules and run? I did find out we were using SAMTools 0.1.1 and have since updated it: samtools 1.9.

In this scenario will I need to update the R script? Thank you!

module purge module load STAR/2.7.1a-foss-2018b module load UMI-tools/1.0.0-foss-2018b-Python-3.6.6 module load StringTie/2.1.0-foss-2018b module load SAMtools/1.9-foss-2018b module load R-bundle-Bioconductor/3.9-foss-2018b-R-3.6.0

Rscript VIRUS/VIRAL_TRACK/Viral_Track_scanning.R VIRUS/VIRAL_TRACK/Parameters.txt VIRUS/VIRAL_TRACK/Files_to_process.txt

PierreBSC commented 4 years ago

Hi Lauren,

Unfortunately I think you will have to modify the script as I have mentionned otherwise R will not be able to find the modules...

loverend commented 4 years ago

Hi Pierre,

Okay will do!

I actually had a further question - I am really interested in this tool as I have been building my own custom annotations for single cell viral mapping with 10X for certain viruses. This is approach is quite laborious as any overlapping genes, have to be segmented and renamed to prevent reads being lost in the alignment stage so its very much a 'we suspect this virus is present' approach. I had been hoping to pair it with ViralTrack to get the broader overview. However, have you looked at creating a reference with the per /gene virusSite file and got any results with this? Unfortunately, it doesn't seem to have corrected for any overlapping gene issues, and they are not annotated which isn't ideal for viral gene mapping, but I am curious to hear your thoughts on this? And whether there may be any way to integrate a viral-track approach into a cell-ranger pipeline.

Thanks for a great tool, Lauren

PierreBSC commented 4 years ago

Hi Lauren,

Indeed it could be interesting to have such annotation however I have never done it. One possibility would to download the annotation from the NCBI Refseq server.... which I was never able to do ^^ Anyway if you are using shortread, and even 'worst' 3' biased technique, computing the exact expression table does not make sense, especially for viruses with overlapping genes. In that case I would rather work on the coverage itself, especially if the virus has a short genome. On which virus are you working ? Concerning the integration of Viral-Track into CellRanger that would indeed be really interesting but will probably not be done for a while...

Best

Pierre

loverend commented 4 years ago

At the minute I am working on a Human Herpes Virus reference (with a few more lined up)- and have simply allocated a 'coupled-gene' name to any overlapping segments, so any reads mapping here are considered as a potential read for either gene - hopefully to be untangled at a later date in the PhD. This was then run with a human reference through cell-ranger (10x + MIseq). And of course this only works with PolyA viruses unfortunately. This has allowed me to observe viral lytic / latent expression on a single cell basis. However as I said, it took about 3 weeks of manual curation to build the reference so rather impractical on a large scale. Although I am hoping to validate the general viral load etc with your technique and see if we can pick up any cases of co-infection etc.

loverend commented 4 years ago

Hi Pierre, I left it running and while the mapping occurred it seems to terminate with:

BAM sorting: 194516 mapped reads BAM sorting bins genomic start loci: 1 1536 115039 2 1536 115058 3 1536 126054 4 1536 126109 5 1536 126124 6 1536 126147 7 3935 1759 8 5549 5799 9 8269 2656 10 11296 147 11 11296 148 12 11296 161 13 11296 174 14 11988 42700967 15 11988 167788528 16 11989 102402427 17 11990 65499646 18 11991 6538280 19 11991 95467278 20 11992 54440675 21 11994 30706498 22 11995 1771893 23 11995 89562970 24 11996 39204632 25 11998 1106684 26 11998 39433210 27 11998 58394707 28 11999 32358048 29 11999 32504017 30 11999 32854602 31 11999 32877947 32 11999 32892936 33 11999 32916215 34 11999 32916413 35 11999 32916467 36 11999 96263377 37 12000 18484545 38 12002 22906703 39 12003 23919577 40 12003 177120672 41 12004 152847144 42 12005 143281128 43 12006 32659886 44 12006 166929516 45 12007 137514148 46 12009 19376257 47 12009 137110545 48 12011 24788336 Thread #3 end of input stream, nextChar=-1 Completed: thread #7 Completed: thread #0 Completed: thread #3 Completed: thread #2 Completed: thread #5 Completed: thread #1 Joined thread # 1 Joined thread # 2 Joined thread # 3 Completed: thread #4 Joined thread # 4 Joined thread # 5 Completed: thread #6 Joined thread # 6 Joined thread # 7 May 27 21:07:29 ..... started sorting BAM

Then the output bam file created is empty: SRR8437614_S1_L001_R1_001_extracted_Aligned.sortedByCoord.out.bam

(8 nodes x 16GB)

Job error: (presumably as it couldnt write to bam) Warning messages: 1: In dir.create(temp_output_dir) : '/well/immune-rep/users/kvi236/VIRUS/COVID_TEST//SRR8437614_S1_L001_R1_001_extracted' already exists 2: In dir.create(temp_output_dir) : '/well/immune-rep/users/kvi236/VIRUS/COVID_TEST//SRR8437610_S1_L001_R1_001_extracted' already exists samtools index: "/well/immune-rep/users/kvi236/VIRUS/COVID_TEST//SRR8437614_S1_L001_R1_001_extracted//SRR8437614_S1_L001_R1_001_extracted_Aligned.sortedByCoord.out.bam" is in a format that cannot be usefully indexed Error in colnames<-(*tmp*, value = c("N_reads", "N_unique_reads", : attempt to set 'colnames' on an object with less than two dimensions Calls: colnames<- -> colnames<- Execution halted

PierreBSC commented 4 years ago

Hi Lauren,

This is really weird. The size of the BAM file seems really small. What is the size of the initial fastq file ? One possibility would be that you send me a small part of the fastq file so I can test the file on my set up.

loverend commented 4 years ago

Hi - Yes the bam file is 0.kb so nothing has written to it. It seems to stall a the sorting bam file stage. My fastqs are about 39Gb per F or R file. We checked for memory and it only got up to 45Gb of RAM so it doesn't appear to be a memory issue.

PierreBSC commented 4 years ago

Hi Lauren,

If your file is ~40GB and your memory only of 45GB then bad things migth happen. With a file of 14GB (.fastq.gz) I got more than 35 GB stored on my RAM during the processing by STAR. STAR is really fast but at the price of memory. What I would do would be to first sample a small part of the fastq file (~1-2Gb) and then try to map it.

loverend commented 4 years ago

Thanks for the help with this. So it had 128Gb available to it over all the cluster slots but the maximum usage of the script only ran to 45GB before it terminated without writing the bam. I have tried resubmitting it with even more memory available but that didn't seem to be the issue originally. What would you expect to see at the end of the extracted_log.out file? Presumably more than this: (just trying to trouble shoot at which stage its failing) Completed: thread #0 Joined thread # 1 Joined thread # 2 Joined thread # 3 Joined thread # 4 Joined thread # 5 Joined thread # 6 Joined thread # 7 May 28 04:50:28 ..... started sorting BAM

loverend commented 4 years ago

Hi Pierre - we have realised this was most likely a reference build error (we used a human reference already available on the cluster). I rebuild per chromosome and this time it did run - however there was no QC pdf output nor does every sample seem to have run to completion - for two no viral BAM files were made and we are missing the count_chromosome.txt files (despite the fact we know viral reads are present) and for the other all the top hits were non-human viruses and no qc plots ect. Wondered if you had any idea why this might be or how we could try reruning just the QC stages. Thanks

PierreBSC commented 4 years ago

Hi Lauren,

Can you send me the log of the Rscript command, together with the list of the files present in the directory ? Was the mapping sucessful ?

loverend commented 4 years ago

Hi Pierre, would it be easier to send the files over email?

PierreBSC commented 4 years ago

Of course. My email is pierre dot bost at pasteur dot fr