Mariam Oweda ,Mohamed Nofal ID: 191057 ,Mohamed Emam ID: 1910038 ,Nouran Tantawy ID: 181045 ,Hadder Hassan ID: 191051
The great advantage of RNA sequencing is to answer the biological questions that lead to a generation of huge amounts of gene expression data across different biological fields such as biology and medicine. Differential expression analysis of RNA-seq data become one of the frequently used analysis to understand cellular processes in biological and biomedical research moreover to discover diagnostic markers for diseases. However, one of the crucial steps for significant differential expression is precise mapping of the reads to the its transcript. With the advances in the NGS technologies different software packages are developed to overcome the mapping problems like repeats and pseudogenes accompanied with competitive performance and accuracy. From this point, we will test the concordance of RNA sequencing (RNA-seq) analysis output between five mapping software; three alignment-based tools; HISAT2, STAR and the recently developed MAGIC BLAST which does not build an index of a genome and instead it builds an index of a batch of reads and scans a BLAST database for potential matches and two alignment-free tools; KALISTO and SALMON with the most common program for differential gene expression in RNA-seq experiments DESeq2. we will use publicly available RNA-seq dataset of 64 paired end Illumina Hepatocellular Carcinoma samples that correlates with survival. The samples were retrospectively derived from hepatocellular carcinoma tissue as well as non-tumor tissue from the livers of the same patients. We will investigate the differences in aligners performance through comparing DESeq2 list of differentially expressed genes for each aligner and validate the results accuracy based on wet lab published literature. As transcriptomics analysis becomes an important tool of precision medicine, the choice of the bioinformatics software is a very critical step for clinical research.
Sequence counts for each sample. Duplicate read counts are an estimate only.
The mean quality value across each base position in the read.
The number of reads with average quality scores. Shows if a subset of reads has poor quality.
The proportion of each base position for which each of the four normal DNA bases has been called.
The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.
The percentage of base calls at each position for which an N was called.
The relative level of duplication found for every sequence.
The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.
Status for each FastQC section showing whether results seem entirely normal (green), slightly abnormal (orange) or very unusual (red).