WGLab / VirTect

Detection of viruses from RNA-Seq on human samples
45 stars 13 forks source link

Updating VirTect Installation and changing VirTect.py script to work with Python 3 and Hisat2 #10

Open JonPitsch314 opened 1 year ago

JonPitsch314 commented 1 year ago

Hello, first of all, I want to thank you for all the work that put into creating this tool. I found a few issues when I tried to install and run it based on the instructions in the Github, so I thought I would submit an issue to show what I found. Please let me know what you think.

Creating Environment

conda create -n virtect
git clone https://github.com/WGLab/VirTect
cd VirTect
conda install -c bioconda cutadapt
conda install -c bioconda hisat2 (TopHat is deprecated: https://ccb.jhu.edu/software/tophat/index.shtml)
conda install -c bioconda bwa
conda install -c bioconda bowtie2
conda install -c bioconda samtools
conda install -c bioconda bedtools
conda install -c bioconda fastqc
conda install -c bioconda igv
conda install -c bioconda bioconductor-rdavidwebservice (DAVID: Visualization tool)

Download Fasta Index

Modified the file paths in the download_fasta_index_v0.0.2.py script:

Changed line 36 from:

os.system('''gunzip human_reference/gencode.v29.2wayconspseudos.gtf.gz''')

To:

os.system('''gunzip human_reference/gencode.v29.annotation.gtf.gz''')

This change was needed because the previous gunzip command was trying to operate on a non-existent file, 'gencode.v29.2wayconspseudos.gtf.gz'.


Changed line 55 from:

os.system('''gunzip human_reference/GRCh37_mapping/gencode.v29lift37.annotation.gtf.gz''')

To:

os.system('''gunzip human_reference/gencode.v29lift37.annotation.gtf.gz''')

This change was needed because the previous gunzip command was trying to access a file in a non-existent directory, 'GRCh37_mapping'.


Now the following command should run without issue:

python download_fasta_index_v02.py -buildver hg38/hg19 -index YES

Create Hisat2 Reference Genome Index files

It is recommended by the TopHat creators to use Hisat2, as it will work with Python 3 and newer versions of the tools used in the VirTect.py script.

To create a hisat2 reference genome index, load the virtect environment and run the following command:

hisat2-build human_reference/GRCh38.p12.genome.fa GRCh38.p12.genome_index

The above command will create a series of files with the .ht2 file extension. A screenshot of mine looks like this:

Screen Shot 2023-03-28 at 1 15 53 PM

Modifying the VirTect.py Script

There were only a few changes that needed to be made to work with Python 3.

  1. There were some print statements that needed parentheses.

  2. The tophat command needed to be replaced with a command that worked with hisat2:

    def alignment(): cmd1='hisat2 -f -x '+hisat_index+' -1 reads_fasta_1.fa -2 reads_fasta_2.fa -S '+out+'/unmapped.sam -p '+n_thread+'' print ('Running ', cmd1) os.system(cmd1) alignment()

  3. However, hisat2 takes fasta files as an input instead of fastq files. So I chose to write a function that precedes the alignment function:

    print("Converting reads to FASTA format")
    def fastq_to_fasta():
       cmd_fa1 = 'sed -n \'1~4s/^@/>/p;2~4p\' '+fq1+' > reads_fasta_1.fa'
       print('Running ', cmd_fa1)
       os.system(cmd_fa1)
       cmd_fa2 = 'sed -n \'1~4s/^@/>/p;2~4p\' '+fq2+' > reads_fasta_2.fa'
       print('Running ', cmd_fa2)
       os.system(cmd_fa2)
    fastq_to_fasta()
  4. Adding an argument for the hisat2 index file:

    parser.add_argument('-hisat_index', '--hisat_index',  required = True, metavar = 'human genome index', type = str, help ='The hisat2 index file of the human reference genome')

You'll see in my version of the VirTect script that I commented out the arguments for -ucsc_gene and -index:

#parser.add_argument('-ucsc_gene', '--gtf',  required = True, metavar = 'gtf', type = str, help ='The input gtf file')

#parser.add_argument('-index', '--index_dir',  required = True, metavar = 'index files', type = str, help ='The directory of index files with hg38 prefix of the fasta file i.e,. index_files_directory/hg38')
  1. Changing Usage Message:

I changed the usage message to show the correct arguments needed to run the script:

usage = """Usage: %prog[-h] -1 read1.fastq -2 read2.fastq -o The output name for
              alignement -ucsc_gene gtf -index index files -index_vir
              virus fasta -hisat_index human genome index [-t Number of threads, default: 8]"""
  1. Lastly, fixing the samtools depth command:

The original command caused an error in the second awk command in this string. There was a missing \ from the command. I changed the samtools depth command to the following:

os.system('''samtools depth '''+out+"/unmapped_aln_sorted.bam"+''' | awk '{if ($3>=5) print $0}' | awk '{ if ($2!=(ploc+1)) {if (ploc!=0){printf("%s %d-%d\\n",$1,s,ploc);}s=$2} ploc=$2; }' > '''+out+"/continuous_region.txt"+'''  ''')

I added the missing \ after the %s %d-d string.

My version of the download_fasta_index_v0.0.2.py script

import os import sys import argparse

prog_name = 'download_fasta_index_v0.0.2'

def main():

parser = argparse.ArgumentParser(description='Downloaded the fasta file and generate the index file', prog = prog_name)

parser.add_argument('-buildver', '--hg',  required = True, default=None, metavar = 'The genome version', type = str, help ='The version of the genome')
parser.add_argument('-index', '--index',  required = False, default="No", metavar = 'created index file', type = str, help ='Create index file')

args = parser.parse_args()
hg=args.hg
index=args.index

os.system('''mkdir -p  human_reference''')

def download_fasta(hg,index):

    if (hg=="hg38"):
        print ("Notice: The downloading the hg38 fasta file from gencode")
        os.system('''wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh38.p12.genome.fa.gz -P human_reference''')

        print ("Notice: The downloading the gtf file from gencode")
        os.system('''wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz -P human_reference''')

        print ("Notice: Unziping the gzip files")
        os.system('''gunzip  human_reference/GRCh38.p12.genome.fa.gz''')
        os.system('''gunzip human_reference/gencode.v29.annotation.gtf.gz''')
        #os.system('''gunzip human_reference/gencode.v29.2wayconspseudos.gtf.gz''')

        if (index=="YES"):
            print ("Notice: Generating the index file by using bowte 2")

            os.system('''bowtie2-build human_reference/GRCh38.p12.genome.fa human_reference/GRCh38.p12.genome''')
        else:
            print ("Notice: Only download the fasta and gtf file")    

    elif (hg=="hg19"):
        print ("Notice: The downloading the hg19 fasta file from gencode")
        os.system('''wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz -P human_reference''')

        print ("Notice: The downloading the gtf file from gencode")
        os.system('''wget  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh37_mapping/gencode.v29lift37.annotation.gtf.gz -P human_reference''')

        print ("Notice: Unziping the gzip files")
        os.system('''gunzip  human_reference/GRCh37.primary_assembly.genome.fa.gz''')
        os.system('''gunzip human_reference/gencode.v29lift37.annotation.gtf.gz''')
        #os.system('''gunzip human_reference/GRCh37_mapping/gencode.v29lift37.annotation.gtf.gz''')

        if (index=="YES"):
            print ("Notice: Generating the index file by using bowte 2")
            os.system('''bowtie2-build human_reference/GRCh37.primary_assembly.genome.fa human_reference/GRCh37.primary_assembly.genome''')

        else:
            print ("Notice: Only downloading the fasta and gtf file")
    else:
        print ("Notice: Please choose your fasta file version such as hg19 or hg38")
download_fasta(hg,index)

if name == 'main': main()

My Version of the VirTect.py script

import sys import argparse import os import subprocess import os.path import fileinput

prog="VirTect.py"

usage = """Usage: %prog[-h] -1 read1.fastq -2 read2.fastq -o The output name for alignement -ucsc_gene gtf -index index files -index_vir virus fasta -hisat_index human genome index [-t Number of threads, default: 8]"""

def main():

parser = argparse.ArgumentParser(description='VirTect: A pipeline for Virus detection')

parser.add_argument('-t', '--n_thread', required = False, metavar = 'Number of threads, default: 8', default = '8', type = str, help ='Number of threads') 
parser.add_argument('-1', '--fq1',  required = True, metavar = 'read1.fastq', type = str, help ='The read 1 of the paired end RNA-seq')

parser.add_argument('-2', '--fq2',  required = True, metavar = 'read2.fastq', type = str, help ='The read 2 of the paired end RNA-seq')

parser.add_argument('-o', '--out',  required = True, metavar = 'The output name for alignement', type = str, help ='Define the output directory to be stored the alignement results')

#parser.add_argument('-ucsc_gene', '--gtf',  required = True, metavar = 'gtf', type = str, help ='The input gtf file')

#parser.add_argument('-index', '--index_dir',  required = True, metavar = 'index files', type = str, help ='The directory of index files with hg38 prefix of the fasta file i.e,. index_files_directory/hg38')

parser.add_argument('-index_vir', '--index_vir',  required = True, metavar = 'virus fasta', type = str, help ='The fasta file of the virus genomes')

parser.add_argument('-hisat_index', '--hisat_index',  required = True, metavar = 'human genome index', type = str, help ='The hisat2 index file of the human reference genome')

parser.add_argument('-d', '--distance', required = True, metavar = 'continuous_distance', type = int, help ='Define the continuous mapping distance of mapping reads to virus genome')

args = parser.parse_args()

fq1 = os.path.abspath(args.fq1)

try:
    #f1=open(fq1,'r')
    os.path.isfile(fq1)
    f1=open(fq1,'r')
except IOError:
    print('Error: There was no Read 1 FASTQ file!')
    sys.exit()

fq2 = os.path.abspath(args.fq2)

try:
    os.path.isfile(fq2)
    f2=open(fq2,'r')

except IOError:
    print('Error: There was no Read 2 FASTQ file!')
    sys.exit()

out = os.path.abspath(args.out)

#gtf = os.path.abspath(args.gtf)

#try:
#    os.path.isfile(gtf)
#    f2=open(gtf,'r')

#except IOError:
#    print('Error: There was no GTF file!')
#    sys.exit()   

#index_dir = os.path.abspath(args.index_dir)

#try:
#    os.path.isfile(index_dir)
   # f4=open('hg38'+'."fa"','r')
#except IOError:
#    print('Error: There was no fasta index directory!')
#    sys.exit()

index_vir = os.path.abspath(args.index_vir)

#try:
 #   os.path.isfile(index_vir)
  #  f4=open(index_vir/viruses_757.fasta,'r')
#except IOError:
 #   print('Error: There was no virus fasta index directory!')
  #  sys.exit()

hisat_index = args.hisat_index

n_thread = args.n_thread

distance = args.distance

print("Converting reads to FASTA format")
def fastq_to_fasta():
    cmd_fa1 = 'sed -n \'1~4s/^@/>/p;2~4p\' '+fq1+' > reads_fasta_1.fa'
    print('Running ', cmd_fa1)
    os.system(cmd_fa1)
    cmd_fa2 = 'sed -n \'1~4s/^@/>/p;2~4p\' '+fq2+' > reads_fasta_2.fa'
    print('Running ', cmd_fa2)
    os.system(cmd_fa2)
fastq_to_fasta()

print ("Aligning by hisat2")
def alignment():
    cmd1='hisat2 -f -x '+hisat_index+' -1 reads_fasta_1.fa -2 reads_fasta_2.fa -S '+out+'/unmapped.sam -p '+n_thread+''
    print ('Running ', cmd1)
    os.system(cmd1)
alignment()

def bam2fastq():
    cmd2 ='samtools sort -n  '+out+'/unmapped.sam  -o '+out+'/unmapped_sorted.bam -@ '+n_thread+'' 
    print ('Running ', cmd2)
    os.system(cmd2)    
    cmd3='bedtools bamtofastq -i  '+out+'/unmapped_sorted.bam -fq  '+out+'/unmapped_sorted_1.fq -fq2  '+out+'/unmapped_sorted_2.fq'    
    print ('Running ', cmd3)
    os.system(cmd3)
bam2fastq()

def bwa_alignment():
    cmd4= 'bwa mem '+index_vir+' -t '+n_thread+'  '+out+'/unmapped_sorted_1.fq '+out+'/unmapped_sorted_2.fq > '+out+'/unmapped_aln.sam'
    print ('Running ', cmd4)
    os.system(cmd4)
bwa_alignment()

def virus_detection():
    cmd5= 'samtools view -Sb -h -@ '+n_thread+' '+out+'/unmapped_aln.sam > '+out+'/unmapped_aln.bam'
    print ('Running ', cmd5)
    os.system(cmd5)

    cmd6= '''samtools view '''+out+"/unmapped_aln.bam"+''' | cut -f3 | sort | uniq -c | awk '{if ($1>=400) print $0}' > '''+out+"/unmapped_viruses_count.txt"+''' '''
    print ('Running ', cmd6)
    os.system(cmd6)
virus_detection() 

def sort():
    cmd7= '''samtools sort '''+out+"/unmapped_aln.bam"+'''  -o '''+out+"/unmapped_aln_sorted.bam"+''' '''
    os.system(cmd7)
sort()

#
#subprocess.call("./continuous_region.sh", shell=True)
os.system('''samtools depth '''+out+"/unmapped_aln_sorted.bam"+''' | awk '{if ($3>=5) print $0}' | awk '{ if ($2!=(ploc+1)) {if (ploc!=0){printf("%s %d-%d\\n",$1,s,ploc);}s=$2} ploc=$2; }' > '''+out+"/continuous_region.txt"+'''  ''')

print ("The continous length")
file =open(out+"/continuous_region.txt", "r")

out_put =open(out+"/Final_continous_region.txt", "w")

if (os.fstat(file.fileno()).st_size) >0:
        for i in file.readlines():
            i1=i.split()[0]
            i2=i.split()[1]
            j1=i2.split("-")
            j2=int(j1[1])-int(j1[0])

            if j2 >= distance:
                j3=i1 + "\t" +  str(j1[0]) + '\t' +  str(j1[1])
                out_put.write('%s\n' % j3)

            else:
                pass
else:
    pass 
out_put.close()

final_output=open(out+"/Final_continous_region.txt",'r')
if (os.fstat(final_output.fileno()).st_size) >0:
    print ("----------------------------------------Note: The sample may have some real virus :(-----------------------------------------------------")
    headers = 'virus transcript_start transcript_end'.split()
    for line in fileinput.input([out+'/Final_continous_region.txt'], inplace=True):
        if fileinput.isfirstline():
            print ('\t'.join(headers))
        print (line.strip())
else:
    print ("----------------------------------------Note: There is no real virus in the sample :)-----------------------------------------------------")

if name == 'main': main()

kaichop commented 1 year ago

Thank you very much for the testing and the thorough explanation of the changes and improvements. Do you mind submit a pull request, so that we can merge it into master branch? On the tophat vs hisat issue, I think an argument that lets users to specify the aligner would be preferred to maintain backward compatibility and comparability with old versions.

JonPitsch314 commented 1 year ago

You're welcome! I'm happy to create a pull request for this. Also, I agree with you about making an argument that lets users specify the aligner that they want to use. I will test a few ideas to create that, then submit the pull request.