czbiohub-sf / nf-predictorthologs

*de novo* orthologous gene predictions from bam + bed or fasta/fastq data
MIT License
4 stars 2 forks source link

Only do sourmash search on unaligned hashes #36

Closed olgabot closed 4 years ago

olgabot commented 4 years ago

Not doing the below anymore, moved to https://github.com/czbiohub/nf-predictorthologs/pull/41.

On the way to doing this, also needed to separate aligned/unaligned hashes and do sourmash search on only one of those. To have some semblance of sanity and not millions of commits per PR, I'm separating those into separate PRs.

~For differentially expressed hashes or provided hashes, if the --csv csv contains bam column then filter the input bams for read ids of sequences containing hashes, then do featurecounts to figure out if that hash is in one of seven categories:~

~1. Not aligned

  1. Aligned and multimapped
  2. Aligned and single-mapped, but not in a gene
  3. Aligned and single-mapped, in a gene, but no known orthology
  4. Aligned and single-mapped, in a 1:1 orthologous gene
  5. Aligned and single-mapped, in a 1:many orthologous gene
  6. Aligned and single-mapped, in a many:many orthologous gene~

~### Example featureCounts.txt.summary~

Status  /home/olga/data_sm/tabula-microcebus/analyses/kmermaid/blood_cross_species/protein_ksize7/J7_B000578_B009057_S223.microcebus.Aligned.out.sorted.reads-in-shared-hashes.bam
Assigned    26511
Unassigned_Unmapped 0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 16919
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   3598
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    2625

~### Example orthology_counts_mqc.txt~

unknown_orthology_ensembl97 3911
ortholog_one2one    38904
not_orthologous 9494
ortholog_one2many   8093
ortholog_many2many  542

PR checklist

Learn more about contributing: CONTRIBUTING.md

olgabot commented 4 years ago

Some benchmarking of ripgrep: https://github.com/czbiohub/nf-predictorthologs/pull/39#issuecomment-620254699 and https://github.com/czbiohub/nf-predictorthologs/pull/39#issuecomment-620261610

olgabot commented 4 years ago

Doing autocommits right now because the test data (https://github.com/czbiohub/test-datasets/pull/8) is local to czbiohub machines so I can't test any of the pipeline locally, need to push to servers to test. Once the pipeline is semi-working, I will subset the 22GB bam to just the relevant data so I can test locally.

olgabot commented 4 years ago

this was a bad direction.. killing this PR