test methods to prioritize inclusion of isoseq transcripts

tldr

PR updates contig selection code to include all isoseq transcripts. If there are multiple per orthogroup, it keeps them all. If there are none in an orthogroup, it selects the best short read assembled contig based on transrate scores. If there are short read and long read, it only keep the long read.

Details

After @elizabethmcd made comments on #5, and while making the roadmap for reads2transcriptome, I realized we're systematically biasing against the inclusion of isoseq transcripts in our pipeline.

In the first iteration of orthofuser, I did not run transrate on the isoseq transcriptome bc there is no single natural short read set to use.
This means that none of the isoseq contigs went into the contig selection step with a score
...which resulted in the only isoseq contigs being selected being those in orthogroups that only have isoseq contigs in them

By the numbers:

Original isoseq data has 307,541 contigs
Deduplicated input transcriptomes has 283,017 contigs
Orthofuser transcriptome has 16,553 contigs
Decontaminated transcriptome has 16,512 contigs

The first approach I'm trying to fix this is to combine all of the short reads, run digital normalization on them to reduce the total size without reducing the diversity, and then run transrate on the isoseq transcriptome using this read set. This would the isoseq data contig scores derived from all of the short reads. This approached failed bc there were too many reads to map (which incidentally, also means that mapping this full read set will fail for the final transcriptome, so we'll need a different approach for assessment, or to run transrate without alignment.)

[ INFO] 2023-09-15 08:49:01 : Calculating contig metrics...
[ INFO] 2023-09-15 08:50:38 : Contig metrics:
[ INFO] 2023-09-15 08:50:38 : -----------------------------------
[ INFO] 2023-09-15 08:50:38 : n seqs                       307541
[ INFO] 2023-09-15 08:50:38 : smallest                         57
[ INFO] 2023-09-15 08:50:38 : largest                       16624
[ INFO] 2023-09-15 08:50:38 : n bases                  1343427230
[ INFO] 2023-09-15 08:50:38 : mean len                    4368.26
[ INFO] 2023-09-15 08:50:38 : n under 200                      61
[ INFO] 2023-09-15 08:50:38 : n over 1k                    306148
[ INFO] 2023-09-15 08:50:38 : n over 10k                     1733
[ INFO] 2023-09-15 08:50:38 : n with orf                   231210
[ INFO] 2023-09-15 08:50:38 : mean orf percent              30.43
[ INFO] 2023-09-15 08:50:38 : n90                            2912
[ INFO] 2023-09-15 08:50:38 : n70                            4007
[ INFO] 2023-09-15 08:50:38 : n50                            4840
[ INFO] 2023-09-15 08:50:38 : n30                            5798
[ INFO] 2023-09-15 08:50:38 : n10                            7450
[ INFO] 2023-09-15 08:50:38 : gc                             0.49
[ INFO] 2023-09-15 08:50:38 : bases n                           0
[ INFO] 2023-09-15 08:50:38 : proportion n                    0.0
[ INFO] 2023-09-15 08:50:38 : Contig metrics done in 97 seconds
[ INFO] 2023-09-15 08:50:38 : Calculating read diagnostics...
[ERROR] 2023-09-15 09:12:49 : Snap failed
Welcome to SNAP version 1.0dev.96.

Failed to decompress BAM file at offset 4995416064
SNAP exited with exit code 1 from line 1930 of file SNAPLib/DataReader.cpp

The next approach i'm going to try and implement is in the python selection script to select the long read contig from any orthogroup that has it. My logic in this comes after talking to a pacbio isoseq expert:

the isoseq transcripts will always faithfully represent the full length of the input cDNA because RNA loves to degrade, especially at the 5' end, the transcripts themselves might be fragments.
the pacbio isoseq data analysis pipeline auto-merges 5' degraded transcripts to longer length parent transcripts if those longer length parents exist in the data. this means that short sequences are either
- fragments that will disappear in analysis bc they can be detected as such (this is already run on our isoseq input data)
- low abundance degraded transcripts
- real
So the risk is that we will select a low abundance degraded transcript instead of a better short read transcript. I think for now we just need to accept that risk and move on 🤷‍♀️ the alternatives are:
- re-writing the transrate tool, which doesn't sound appealing
- checking contig length in each orthogroup or other sophisticated comparisons to see if we should retain the isoseq transcript.

Results from this approach -- busco completeness improved by 6.6%. we do increase duplication, but I think that's ok.

old selection criteria results:

        ---------------------------------------------------
        |Results from dataset arthropoda_odb10             |
        ---------------------------------------------------
        |C:91.1%[S:33.8%,D:57.3%],F:5.3%,M:3.6%,n:1013     |
        |922    Complete BUSCOs (C)                        |
        |342    Complete and single-copy BUSCOs (S)        |
        |580    Complete and duplicated BUSCOs (D)         |
        |54     Fragmented BUSCOs (F)                      |
        |37     Missing BUSCOs (M)                         |
        |1013   Total BUSCO groups searched                |
        ---------------------------------------------------

New selection criteria results:

        C:97.7%[S:19.1%,D:78.6%],F:1.6%,M:0.7%,n:1066

        1042    Complete BUSCOs (C)
        204     Complete and single-copy BUSCOs (S)
        838     Complete and duplicated BUSCOs (D)
        17      Fragmented BUSCOs (F)
        7       Missing BUSCOs (M)
        1066    Total BUSCO groups searched

Arcadia-Science / 2023-amblyomma-americanum-txome-assembly

test methods to prioritize inclusion of isoseq transcripts #14

tldr

Details