Methods: The individual transcriptomes of the P. australis species were placed into three tissue categories with three replicate transcriptomes. The three tissue categories were as follows: eye, light organ and shield. The three tissue categories were then divided into two subcategories: active tissue vs. inactive tissue. And the activity level is based on whether or not the firefly was undergoing light emission. The distribution of the transcriptomes were as follows: active eye tissue (8A, 8B, 8P), inactive eye tissue (10A, 10B, 10P), active light organ tissue (9A, 9B), inactive light organ tissue (19A, 19B), active shield tissue ( 20A, 20B, 20P), and inactive shield tissue (21A, 21B). All of the following processes were performed in the Sapelo cluster at UGA, in exception for the production of GGplots, which were performed in a program named R.

Identifying genes (Run_star.sh) We used a reference genome of P. pyralis to map P. australis transcriptomes for gene identification, using Star v. 2.5.3a. Generating a reference genome to use for mapping transcriptomes will assist in gene expression analysis and gene identification. The transcriptome used is P. australis ,which is aligned to the reference genome of P. pyralis. These annotated genes will be aligned using the bash script, run_star.sh, which prepares the reference genome to be used for mapping the transcriptome. This was performed in the UGA cluster with the program named Sapelo. Now after submitting the modified run_star.sh file and receiving the .sam and .bam files, these were used to generate the reference genome to which we will map each three individual transcriptomes of our tissue categories for the active and inactive states to the reference genome. The .sam and .bam files are text files that carry the sequences that are aligned with the reference genome and .sam stands for Sequence Alignment Map format and .bam is the .sam file translated into binary form.

Code:

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

module load star/2.5.3a cd #PBS_O_WORKDIR /usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /lustre1/dad70505/genome_files/ --genomeFastaFiles /lustre1/dad70505/genome_files/Ppyrl.3.fasta --sjdbGTFfile /lustre1/dad70505/genome_files/PPYR_OGS1.0.gtf --sjdbOverhang 74

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

Run_mapping.sh The run_star.sh script allowed for the production of .sam and .bam files, which were used for mapping transcriptomes to reference genome to produce aligned pair-reads. This process was performed for each individual transcriptome data set, which is inclusive of the inactive and active states for each of the three tissue categories: eye, light organ, and shield. This was performed in the UGA cluster with the program named Sapelo. The output file was again in either the .sam or .bam file format, however, these files included mapped sequences that were a composition of only aligned sequences of the reference genome to the transcriptomes, which were spliced. Each of the run_star.sh scripts were modified for each specific transcriptome ex: 21A, 21B, 10A, 10B , 10P, etc. to be mapped against the reference genome.

Code: Job for 10A R1 and 10A R2

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

module load star/2.5.3a cd $PBS_O_WORKDIR

/usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --outFileNamePrefix 8A --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --quantMode TranscriptomeSAM --genomeDir /lustre1/heyduk/firefly/genome/ --readFilesIn /lustre1/heyduk/firefly/reads/10A_paired_R1.fastq /lustre1/heyduk/firefly/reads/10A_paired_R2.fastq

Job for 21:

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

module load star/2.5.3a cd $PBS_O_WORKDIR

/usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --outFileNamePrefix 21B --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --quantMode TranscriptomeSAM --genomeDir /lustre1/dad70505/genome_files/ --readFilesIn /lustre1/dad70505/21B_paired_R1.fastq/lustre1/dad70505/21B_paired_R2.fastq

Run_RSEM.sh The run_RSEM.sh script converts .sam .bam into readable files that end with .genes.results. This script allowed us to take the aligned and mapped sequences to see the number of transcripts that were expressed in both the reference genome and transcriptomes. This was performed in the UGA cluster with the program named Sapelo. The goal with this script was to quantify the number of transcripts(cDNA fragments) each gene produces and is properly aligned with the reference genome. And we saw which ones are expressed(regardless of what expression levels they have) by analyzing the counts per million and then will be placed in matrices that are produced for one for each of the transcriptomes (21A, 21B, etc.). In which, we would be able to compare statistically the different replicates of the transcriptomes of 21, 20, 10, etc. to see active vs inactive differential gene expression.

Code:

!/bin/bash

PBS -N rsem21B

PBS -q batch

PBS -l nodes=1:ppn=1:AMD

PBS -l walltime=10:00:00

PBS -l mem=40gb

module load rsem/1.3.0 cd $PBS_O_WORKDIR rsem-calculate-expression --bam --paired-end /lustre1/dad70505/21BAligned.toTranscriptome.out.bam /lustre1/dad70505/genome_files/Ppyr1.3.genome 21B

Run_matrix.sh The run_matrix.sh was primarily used to organize each library of transcriptome with parameters of counts and TPM to create matrices that were used for statistical analysis and comparison in R. These matrices were making the data into cleaner formats that are compatible with the statistical analysis program called R, because matrices are compiling the huge amount data (in this case many sequences of DNA) to then be used for further analysis. This was performed in the UGA cluster with the program named Sapelo. (What was the output file?) These files were used to become completed spliced sequenced transcriptomes that will be used for blasting against a known genome for possible screening pigments that exhibit differential expression between inactive and active tissues.

Run_blast.sh The run_blast.sh script was used to compare the alignment between two sequences, and in this case, we blasted the subject sequence as the transcriptomes of the P.australis to the queried sequence as the reference genome of Drosophila. This was performed in the UGA cluster with the program named Sapelo. (What was the output file?) Comparing the alignment between these two sequences was very important because we saw which genes produced possible screening pigments that should be present in genome of P.australis, and these possible screening pigments are pigments expressing bioluminescence in the species of drosophila to which we already know at what wavelength and color each pigment corresponds to. We use these output files to create comparisons between the tissues categories to see if there is any differential expression between the active and inactive states of the each of the tissues.

Code: Run_blast.sh This is for combining the two fasta files 360 cat 8850_Photinus_australis_LW.fasta.txt
361 ls 362 cat 8850_Photinus_australis_LW.fasta.txt 8850_Photinus_australis_UV.fasta.txt > 8850_Photinus_australis_LW+UV.fasta.txt 363 ls -l 364 less 8850_Photinus_australis_LW+UV.fasta.txt 365 less 8850_Photinus_australis_UV.fasta.txt

379 module load ncbiblast+/2.2.29 380 makeblastdb -in 8850_Photinus_australis_LW+UV.fasta.txt -dbtype 'nucl' 381 ls -l 382 ls -l 383 nano run_blast2.sh 384 qsub run_blast2.sh 385
386 qsub run_blast2.sh 387 cd /lustre1/dad70505/ 388 cd BLAST 389 ls -l 390 perl filterBLAST.pl blastoutopsin.txt 391 history > 3/20/18.txt 392 history > History2.txt

Explanation of the columns of blast.sh:

qseqid query (e.g., gene) sequence id
sseqid subject (e.g., reference genome) sequence id
pident percentage of identical matches
length alignment length
mismatch number of mismatches
gapopen number of gap openings
qstart start of alignment in query
qend end of alignment in query
sstart start of alignment in subject
send end of alignment in subject
evalue expect value
bitscore bit score

GGplot2 (R)

The final step in the process of determining the expression differences for the suspected pigments genes in P.australis was performed in the program, R. R plotted differential expression for each of the suspected pigment genes that we blasted with the transcriptomes of P.australis and reference genome of drosophila. R compares two matrices and see how much each gene in each of the matrices is expressed. For example, if comparing the differential expression between active and inactive state of the light organ, the comparisons will pool the three replicates for the active state for the light organ tissue against the three replicate transcriptomes of the inactive state for the light organ. The results in R will use statistical analysis to show how many standard deviations the states of tissues differ and if they call for any significant difference in expression.

https://developer.github.com/v3/guides/working-with-comments/

Kathrin Stanger-Hall, PhD Department of Plant Biology 4510 Miller Plant Sciences Building University of Georgia, Athens 30602 ksh@uga.edu Stanger-Hall Lab http://research.franklin.uga.edu/stanger-hall/

On Fri, Aug 24, 2018 at 10:00 AM dad70505 notifications@github.com wrote:

Methods: The individual transcriptomes of the P. australis species were placed into three tissue categories with three replicate transcriptomes. The three tissue categories were as follows: eye, light organ and shield. The three tissue categories were then divided into two subcategories: active tissue vs. inactive tissue. And the activity level is based on whether or not the firefly was undergoing light emission. The distribution of the transcriptomes were as follows: active eye tissue (8A, 8B, 8P), inactive eye tissue (10A, 10B, 10P), active light organ tissue (9A, 9B), inactive light organ tissue (19A, 19B), active shield tissue ( 20A, 20B, 20P), and inactive shield tissue (21A, 21B). All of the following processes were performed in the Sapelo cluster at UGA, in exception for the production of GGplots, which were performed in a program named R.

Identifying genes (Run_star.sh) We used a reference genome of P. pyralis to map P. australis transcriptomes for gene identification, using Star v. 2.5.3a. Generating a reference genome to use for mapping transcriptomes will assist in gene expression analysis and gene identification. The transcriptome used is P. australis ,which is aligned to the reference genome of P. pyralis. These annotated genes will be aligned using the bash script, run_star.sh, which prepares the reference genome to be used for mapping the transcriptome. This was performed in the UGA cluster with the program named Sapelo. Now after submitting the modified run_star.sh file and receiving the .sam and .bam files, these were used to generate the reference genome to which we will map each three individual transcriptomes of our tissue categories for the active and inactive states to the reference genome. The .sam and .bam files are text files that carry the sequences that are aligned with the reference genome and .sam stands for Sequence Alignment Map format and .bam is the .sam file translated into binary form. Code:

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

module load star/2.5.3a cd #PBS_O_WORKDIR /usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --runMode genomeGenerate --genomeDir /lustre1/dad70505/genome_files/ --genomeFastaFiles /lustre1/dad70505/genome_files/Ppyrl.3.fasta --sjdbGTFfile /lustre1/dad70505/genome_files/PPYR_OGS1.0.gtf --sjdbOverhang 74

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

Run_mapping.sh The run_star.sh script allowed for the production of .sam and .bam files, which were used for mapping transcriptomes to reference genome to produce aligned pair-reads. This process was performed for each individual transcriptome data set, which is inclusive of the inactive and active states for each of the three tissue categories: eye, light organ, and shield. This was performed in the UGA cluster with the program named Sapelo. The output file was again in either the .sam or .bam file format, however, these files included mapped sequences that were a composition of only aligned sequences of the reference genome to the transcriptomes, which were spliced. Each of the run_star.sh scripts were modified for each specific transcriptome ex: 21A, 21B, 10A, 10B , 10P, etc. to be mapped against the reference genome. Code: Job for 10A R1 and 10A R2

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

module load star/2.5.3a cd $PBS_O_WORKDIR

/usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --outFileNamePrefix 8A --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --quantMode TranscriptomeSAM --genomeDir /lustre1/heyduk/firefly/genome/ --readFilesIn /lustre1/heyduk/firefly/reads/10A_paired_R1.fastq /lustre1/heyduk/firefly/reads/10A_paired_R2.fastq

Job for 21:

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

module load star/2.5.3a cd $PBS_O_WORKDIR

/usr/local/apps/star/2.5.3a/bin/STAR --runThreadN 4 --outFileNamePrefix 21B --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --quantMode TranscriptomeSAM --genomeDir /lustre1/dad70505/genome_files/ --readFilesIn /lustre1/dad70505/21B_paired_R1.fastq/lustre1/dad70505/21B_paired_R2.fastq

Run_RSEM.sh The run_RSEM.sh script converts .sam .bam into readable files that end with .genes.results. This script allowed us to take the aligned and mapped sequences to see the number of transcripts that were expressed in both the reference genome and transcriptomes. This was performed in the UGA cluster with the program named Sapelo. The goal with this script was to quantify the number of transcripts(cDNA fragments) each gene produces and is properly aligned with the reference genome. And we saw which ones are expressed(regardless of what expression levels they have) by analyzing the counts per million and then will be placed in matrices that are produced for one for each of the transcriptomes (21A, 21B, etc.). In which, we would be able to compare statistically the different replicates of the transcriptomes of 21, 20, 10, etc. to see active vs inactive differential gene expression.

Code:

!/bin/bash

PBS -N rsem21B

PBS -q batch

PBS -l nodes=1:ppn=1:AMD

PBS -l walltime=10:00:00

PBS -l mem=40gb

module load rsem/1.3.0 cd $PBS_O_WORKDIR rsem-calculate-expression --bam --paired-end /lustre1/dad70505/21BAligned.toTranscriptome.out.bam /lustre1/dad70505/genome_files/Ppyr1.3.genome 21B

Run_matrix.sh The run_matrix.sh was primarily used to organize each library of transcriptome with parameters of counts and TPM to create matrices that were used for statistical analysis and comparison in R. These matrices were making the data into cleaner formats that are compatible with the statistical analysis program called R, because matrices are compiling the huge amount data (in this case many sequences of DNA) to then be used for further analysis. This was performed in the UGA cluster with the program named Sapelo. (What was the output file?) These files were used to become completed spliced sequenced transcriptomes that will be used for blasting against a known genome for possible screening pigments that exhibit differential expression between inactive and active tissues. Explanation of the

Run_blast.sh The run_blast.sh script was used to compare the alignment between two sequences, and in this case, we blasted the subject sequence as the transcriptomes of the P.australis to the queried sequence as the reference genome of Drosophila. This was performed in the UGA cluster with the program named Sapelo. (What was the output file?) Comparing the alignment between these two sequences was very important because we saw which genes produced possible screening pigments that should be present in genome of P.australis, and these possible screening pigments are pigments expressing bioluminescence in the species of drosophila to which we already know at what wavelength and color each pigment corresponds to. We use these output files to create comparisons between the tissues categories to see if there is any differential expression between the active and inactive states of the each of the tissues.

Code: Run_blast.sh This is for combining the two fasta files 360 cat 8850_Photinus_australis_LW.fasta.txt 361 ls 362 cat 8850_Photinus_australis_LW.fasta.txt 8850_Photinus_australis_UV.fasta.txt > 8850_Photinus_australis_LW+UV.fasta.txt 363 ls -l 364 less 8850_Photinus_australis_LW+UV.fasta.txt 365 less 8850_Photinus_australis_UV.fasta.txt

379 module load ncbiblast+/2.2.29 380 makeblastdb -in 8850_Photinus_australis_LW+UV.fasta.txt -dbtype 'nucl' 381 ls -l 382 ls -l 383 nano run_blast2.sh 384 qsub run_blast2.sh 385

386 qsub run_blast2.sh 387 cd /lustre1/dad70505/ 388 cd BLAST 389 ls -l 390 perl filterBLAST.pl blastoutopsin.txt 391 history > 3/20/18.txt 392 history > History2.txt

Explanation of the columns of blast.sh:

qseqid query (e.g., gene) sequence id

sseqid subject (e.g., reference genome) sequence id

pident percentage of identical matches

length alignment length

mismatch number of mismatches

gapopen number of gap openings

qstart start of alignment in query

qend end of alignment in query

sstart start of alignment in subject

send end of alignment in subject

evalue expect value

bitscore bit score

GGplot2 (R)

The final step in the process of determining the expression differences for the suspected pigments genes in P.australis was performed in the program, R. R plotted differential expression for each of the suspected pigment genes that we blasted with the transcriptomes of P.australis and reference genome of drosophila. R compares two matrices and see how much each gene in each of the matrices is expressed. For example, if comparing the differential expression between active and inactive state of the light organ, the comparisons will pool the three replicates for the active state for the light organ tissue against the three replicate transcriptomes of the inactive state for the light organ. The results in R will use statistical analysis to show how many standard deviations the states of tissues differ and if they call for any significant difference in expression.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dad70505/sapelo-training/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AduNsbJXGeiKKOuMH7fKaR5GnC0pujMUks5uUAbggaJpZM4WLZ1s .

dad70505 / sapelo-training

Sapelo + R scripts and purpose #7

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N rsem21B

PBS -q batch

PBS -l nodes=1:ppn=1:AMD

PBS -l walltime=10:00:00

PBS -l mem=40gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=480:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N star

PBS -q batch

PBS -l nodes=1:ppn=4:jlmnode

PBS -l walltime=24:00:00

PBS -l mem=50gb

!/bin/bash

PBS -N rsem21B

PBS -q batch

PBS -l nodes=1:ppn=1:AMD

PBS -l walltime=10:00:00

PBS -l mem=40gb