PierreBSC / Viral-Track

MIT License
56 stars 27 forks source link

Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

Open yeredh opened 4 years ago

yeredh commented 4 years ago

Hello Pierre,

I was able to run Viral-Track successfully. However, I noticed that if you don't run the transcriptome assembly step Viral_Track_cell_demultiplexing.R crashes because the basic GTF file is not created (see code below) and the featureCounts call fails because it can't find the GTF file


###Finding the path to the GTF : if none fonmd -> 'pseudo GTF' created

Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/")

if (!file.exists(Path_GTF)) {
  cat("No GTF file found. Creating a basic GTF file ")

  #Loading the viral annotation file 
  Viral_annotation = read.delim(Viral_annotation_file)
  Viral_annotation = Viral_annotation[Viral_annotation$Name_sequence!=" ",]

  #Selecting the viral segments 
  Viral_annotation = Viral_annotation[Identified_viral_fragments,]
  Length_segments = as.numeric(Viral_annotation$Genome_length)

  for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
  }

}

In the code below, the transcript lines are not written into a text file to update Path_GTF, so

  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(featureCounts(files = paste(List_target_path,"/Reads_to_demultiplex.bam",sep=""),
                    annot.ext = Path_GTF ,isGTFAnnotationFile = T,
                    reportReads = "BAM",reportReadsPath = List_target_path,verbose = F,primaryOnly = T,allowMultiOverlap = T))

crashes because Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/") does not exist.

Thanks again for your time and attention,

Yered

heathergeiger commented 3 years ago

I am not the author, but I solved this by manually editing the code.

transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)

Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.

Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])

And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

juliabelk commented 3 years ago

In case it helps anyone else, I could not get past the gtf file step even after trying the above (not sure if it has anything to do with the R version, I am using R 4). Finally I switched the annotation file format to the SAF option (this example code is only for processing sars-cov-2 detected reads):

  saf_df <- data.frame(
    "GeneID"=c("NC_045512","NC_039199"),
    "Chr"=c("refseq|NC_045512|29903nt|Severe","refseq|NC_039199|13350nt|Human"),
    "Start"=c(1,1),
    "End"=c(29903,13350),
    "Strand"=c(".","."),
    stringsAsFactors=F
  )

  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(
    featureCounts(
      files = paste(path_temp,"/Reads_to_demultiplex.bam",sep=""),
      annot.ext=saf_df,
      reportReads = "BAM",reportReadsPath = path_temp,
      verbose = F,primaryOnly = T,allowMultiOverlap = T 
    )   
  )
ghost commented 2 years ago

I am not the author, but I solved this by manually editing the code.

transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)

Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.

Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])

And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

Thanks, it seems like the author did not add the codes actually creating the pseudo GTF file and your codes could perfectly solve this issue. I just wanna ask how did files named like ""refseq|NC_022518|9472nt|Human"" are created, it did not work on my Ubuntu OS since "|" is not a permitted character in file names. I use awk commands to replace them with other permitted characters.

ghost commented 2 years ago

And in case it may help anyone else, if you encountered problems when exporting viral SAM files with errors of invalid argument, try to change the lines of viral names in the Virusite genome file with this command : awk '{ gsub(/|/, "character allowed"); print %0}' path_to_reference_genome/genomes.fasta > new_reference_genomes.fasta while you can choose any allowed character and separate them by it later.

TeodoraTockovska commented 6 months ago

@heathergeiger Hi Heather. Were you able to run the 2nd script (assembly using StringTie)? It seems that the assembly isn't working for me and I am uncertain why. I have a bam file that has only 99 reads (so quite small) and StringTie is simply not working.