Arcadia-Science / 2023-amblyomma-americanum-txome-assembly

MIT License
0 stars 0 forks source link

test methods to prioritize inclusion of isoseq transcripts #14

Closed taylorreiter closed 11 months ago

taylorreiter commented 11 months ago

tldr

PR updates contig selection code to include all isoseq transcripts. If there are multiple per orthogroup, it keeps them all. If there are none in an orthogroup, it selects the best short read assembled contig based on transrate scores. If there are short read and long read, it only keep the long read.

Details

After @elizabethmcd made comments on #5, and while making the roadmap for reads2transcriptome, I realized we're systematically biasing against the inclusion of isoseq transcripts in our pipeline.

By the numbers:

The first approach I'm trying to fix this is to combine all of the short reads, run digital normalization on them to reduce the total size without reducing the diversity, and then run transrate on the isoseq transcriptome using this read set. This would the isoseq data contig scores derived from all of the short reads. This approached failed bc there were too many reads to map (which incidentally, also means that mapping this full read set will fail for the final transcriptome, so we'll need a different approach for assessment, or to run transrate without alignment.)

[ INFO] 2023-09-15 08:49:01 : Calculating contig metrics...
[ INFO] 2023-09-15 08:50:38 : Contig metrics:
[ INFO] 2023-09-15 08:50:38 : -----------------------------------
[ INFO] 2023-09-15 08:50:38 : n seqs                       307541
[ INFO] 2023-09-15 08:50:38 : smallest                         57
[ INFO] 2023-09-15 08:50:38 : largest                       16624
[ INFO] 2023-09-15 08:50:38 : n bases                  1343427230
[ INFO] 2023-09-15 08:50:38 : mean len                    4368.26
[ INFO] 2023-09-15 08:50:38 : n under 200                      61
[ INFO] 2023-09-15 08:50:38 : n over 1k                    306148
[ INFO] 2023-09-15 08:50:38 : n over 10k                     1733
[ INFO] 2023-09-15 08:50:38 : n with orf                   231210
[ INFO] 2023-09-15 08:50:38 : mean orf percent              30.43
[ INFO] 2023-09-15 08:50:38 : n90                            2912
[ INFO] 2023-09-15 08:50:38 : n70                            4007
[ INFO] 2023-09-15 08:50:38 : n50                            4840
[ INFO] 2023-09-15 08:50:38 : n30                            5798
[ INFO] 2023-09-15 08:50:38 : n10                            7450
[ INFO] 2023-09-15 08:50:38 : gc                             0.49
[ INFO] 2023-09-15 08:50:38 : bases n                           0
[ INFO] 2023-09-15 08:50:38 : proportion n                    0.0
[ INFO] 2023-09-15 08:50:38 : Contig metrics done in 97 seconds
[ INFO] 2023-09-15 08:50:38 : Calculating read diagnostics...
[ERROR] 2023-09-15 09:12:49 : Snap failed
Welcome to SNAP version 1.0dev.96.

Failed to decompress BAM file at offset 4995416064
SNAP exited with exit code 1 from line 1930 of file SNAPLib/DataReader.cpp

The next approach i'm going to try and implement is in the python selection script to select the long read contig from any orthogroup that has it. My logic in this comes after talking to a pacbio isoseq expert:

Results from this approach -- busco completeness improved by 6.6%. we do increase duplication, but I think that's ok.

old selection criteria results:

        ---------------------------------------------------
        |Results from dataset arthropoda_odb10             |
        ---------------------------------------------------
        |C:91.1%[S:33.8%,D:57.3%],F:5.3%,M:3.6%,n:1013     |
        |922    Complete BUSCOs (C)                        |
        |342    Complete and single-copy BUSCOs (S)        |
        |580    Complete and duplicated BUSCOs (D)         |
        |54     Fragmented BUSCOs (F)                      |
        |37     Missing BUSCOs (M)                         |
        |1013   Total BUSCO groups searched                |
        ---------------------------------------------------

New selection criteria results:

        C:97.7%[S:19.1%,D:78.6%],F:1.6%,M:0.7%,n:1066

        1042    Complete BUSCOs (C)
        204     Complete and single-copy BUSCOs (S)
        838     Complete and duplicated BUSCOs (D)
        17      Fragmented BUSCOs (F)
        7       Missing BUSCOs (M)
        1066    Total BUSCO groups searched