Only do sourmash search on unaligned hashes

olgabot commented 4 years ago

Not doing the below anymore, moved to https://github.com/czbiohub/nf-predictorthologs/pull/41.

On the way to doing this, also needed to separate aligned/unaligned hashes and do sourmash search on only one of those. To have some semblance of sanity and not millions of commits per PR, I'm separating those into separate PRs.

~For differentially expressed hashes or provided hashes, if the --csv csv contains bam column then filter the input bams for read ids of sequences containing hashes, then do featurecounts to figure out if that hash is in one of seven categories:~

~1. Not aligned

Aligned and multimapped
Aligned and single-mapped, but not in a gene
Aligned and single-mapped, in a gene, but no known orthology
Aligned and single-mapped, in a 1:1 orthologous gene
Aligned and single-mapped, in a 1:many orthologous gene
Aligned and single-mapped, in a many:many orthologous gene~

~### Example featureCounts.txt.summary~

Status  /home/olga/data_sm/tabula-microcebus/analyses/kmermaid/blood_cross_species/protein_ksize7/J7_B000578_B009057_S223.microcebus.Aligned.out.sorted.reads-in-shared-hashes.bam
Assigned    26511
Unassigned_Unmapped 0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 16919
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   3598
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    2625

~### Example orthology_counts_mqc.txt~

unknown_orthology_ensembl97 3911
ortholog_one2one    38904
not_orthologous 9494
ortholog_one2many   8093
ortholog_many2many  542

PR checklist

[ ] This comment contains a description of changes (with reason)
[ ] If you've fixed a bug or added code that should be tested, add tests!
[ ] If necessary, also make a PR on the nf-core/predictorthologs branch on the nf-core/test-datasets repo
[ ] Ensure the test suite passes (nextflow run . -profile test,docker).
[ ] Make sure your code lints (nf-core lint .).
[ ] Documentation in docs is updated
[ ] CHANGELOG.md is updated
[ ] README.md is updated

Learn more about contributing: CONTRIBUTING.md

olgabot commented 4 years ago

Some benchmarking of ripgrep: https://github.com/czbiohub/nf-predictorthologs/pull/39#issuecomment-620254699 and https://github.com/czbiohub/nf-predictorthologs/pull/39#issuecomment-620261610

olgabot commented 4 years ago

Doing autocommits right now because the test data (https://github.com/czbiohub/test-datasets/pull/8) is local to czbiohub machines so I can't test any of the pipeline locally, need to push to servers to test. Once the pipeline is semi-working, I will subset the 22GB bam to just the relevant data so I can test locally.

olgabot commented 4 years ago

this was a bad direction.. killing this PR

czbiohub-sf / nf-predictorthologs

Only do sourmash search on unaligned hashes #36

PR checklist