Open yeredh opened 4 years ago
I am not the author, but I solved this by manually editing the code.
transcript_lines <- c()
for (k in 1:length(Identified_viral_fragments)) {
transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
}
writeLines(transcript_lines,con=Path_GTF)
Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.
Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")
Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])
And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.
In case it helps anyone else, I could not get past the gtf file step even after trying the above (not sure if it has anything to do with the R version, I am using R 4). Finally I switched the annotation file format to the SAF option (this example code is only for processing sars-cov-2 detected reads):
saf_df <- data.frame(
"GeneID"=c("NC_045512","NC_039199"),
"Chr"=c("refseq|NC_045512|29903nt|Severe","refseq|NC_039199|13350nt|Human"),
"Start"=c(1,1),
"End"=c(29903,13350),
"Strand"=c(".","."),
stringsAsFactors=F
)
#Assigning reads to transcripts using Rsubread Featurecounts
x = suppressMessages(
featureCounts(
files = paste(path_temp,"/Reads_to_demultiplex.bam",sep=""),
annot.ext=saf_df,
reportReads = "BAM",reportReadsPath = path_temp,
verbose = F,primaryOnly = T,allowMultiOverlap = T
)
)
I am not the author, but I solved this by manually editing the code.
transcript_lines <- c() for (k in 1:length(Identified_viral_fragments)) { transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".", paste("gene_id ",Identified_viral_fragments[k],"_1;","")) transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t")) } writeLines(transcript_lines,con=Path_GTF)
Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.
Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="") Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2) Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])
And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.
Thanks, it seems like the author did not add the codes actually creating the pseudo GTF file and your codes could perfectly solve this issue. I just wanna ask how did files named like ""refseq|NC_022518|9472nt|Human"" are created, it did not work on my Ubuntu OS since "|" is not a permitted character in file names. I use awk commands to replace them with other permitted characters.
And in case it may help anyone else, if you encountered problems when exporting viral SAM files with errors of invalid argument, try to change the lines of viral names in the Virusite genome file with this command : awk '{ gsub(/|/, "character allowed"); print %0}' path_to_reference_genome/genomes.fasta > new_reference_genomes.fasta while you can choose any allowed character and separate them by it later.
@heathergeiger Hi Heather. Were you able to run the 2nd script (assembly using StringTie)? It seems that the assembly isn't working for me and I am uncertain why. I have a bam file that has only 99 reads (so quite small) and StringTie is simply not working.
Hello Pierre,
I was able to run Viral-Track successfully. However, I noticed that if you don't run the transcriptome assembly step
Viral_Track_cell_demultiplexing.R
crashes because the basic GTF file is not created (see code below) and thefeatureCounts
call fails because it can't find the GTF fileIn the code below, the transcript lines are not written into a text file to update
Path_GTF
, socrashes because
Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/")
does not exist.Thanks again for your time and attention,
Yered