Wendellab / homoeologGeneExpression-Coexpression

Challenges and pitfall in the use of partitioned gene counts for homoeologous gene expression and co-expression network analyses
GNU General Public License v3.0
6 stars 3 forks source link

Challenges and pitfall in the use of partitioned gene counts for homoeologous gene expression and co-expression network analyses

Input RNA-seq datasets

LSS location

RNA-seq mapping and homoeolog read estimation

Specifically developed for polyploid systems

Generic mapping tools

The A2D5 transcript sequences were used as reference, which were derived from the D5 gene models and A2-D5 SNP index (smb://lss.its.iastate.edu/gluster-lss/research/jfw-lab/GenomicResources/pseudogenomes/A2D5.transcripts.fa).

Data analysis in R - workflow and scripts

Ongoing working dir: /work/LAS/jfw-lab/hugj2006/eflen/output/

LSS long term storage dir: /lss/research/jfw-lab/Projects/Eflen/

Step 1. prepare read count tables

Scripts

Output read count tables:

Explanation of other output files

[method] as "polycat", "hylite", "rsem", "salmon", "kallisto".

Step 2. Evalutation of homoeolog read estimation

Knowing the true subgenome origin of each in silico polyploid ADs reads, we obtained the confusion matrix (TP, TN, FP, FN) to evaluate the homoeolog read classification. Several metrics including Precision/recall, F1 score, MCC, Accuracymetrics were calculated.

In addition, custom measures of Efficiency and Discrepancy were used to examine how different methods deal with ambiguous read alignment (discard or statistical inference). For PolyCat and HyLiTE, Efficiency measures the proportion of total reads aligned to diploid reference that can be partitioned, while this measure for RSEM, Salmon and Kallisto approximates 1 given their different algorithms.

Scripts

Explanation of output files

[method] as "polycat", "hylite", "rsem", "salmon", "kallisto".

Step 3. Differential gene expression analysis

In comparison with the "expected" differentially expressed genes between homoeologs (A2 vs D5), we ask how homoeolog read estimation and DE methods tegoether affect the "observed" (ADs: At vs Dt) lists of DE genes. Two DE analysis algorithms - DESeq2 and EBSeq, in conjuction with each of the five homoeolog read estimation methods (polycat, hylite, rsem, salmon, kallisto) were tested. The detection of "Expected" DE genes can be seen as a binary decision problem, which were evaluated with Sensitivity (=recall), Specificity, Precision, F statistics, MCC, ROC curves and AUC.

Scripts

Explanation of output files

Step 4. Differential gene-pair coexpression analysis

Coexpression of homoeologs and between all possible gene pairs were measured by Pearson's coefficients and then classified by contrasting estimated versus true patterns. Nine classes of DC patterns were resulted and tested for enrichment.

Scripts

Explanation of output files

Step 5. Coexpression network construction

True and estimated homoeolog read count tables from 10 datasets (five mapping pipelines followed by rld or log2rpkm transformation) were subjected to weighted and unweighted coexpression network construction. The same set of genes were included in all networks for fair comparison.

Scripts

Explanation of output files

WGCNA: [method] as polycat, hylite, rsem, salmon or kallisto; [transfomation] as rld or log2rpkm.

Binary networks

Step 6. Assessment of network topology and functional connectivity

Scripts

Explanation of output files

Step 7. Examination of the impact of read ambiguity on performance

The metrics derived from read assignment, DE, DC and network analyses were correlayed with gene groups binned by ambiguity.

Scripts

Explanation of output files