--read_group:file functionality working in release version?

HabiBBBaghirov commented 8 months ago

Hi,

First of all, great job developing IsoQuant, it's a very useful piece of software.

I want to confirm: is the read_group: file functionality currently working as intended in the release version, i.e., the one installed through bioconda? It doesn't in my hands (basically, IsoQuant recognizes the flag and I know that it processes the grouping file, but it ignores everything in the grouping column and just outputs values aggregated for the entire dataset). Then again, neither does yaml functionality, but at least for yaml I was able to fix this by modifying isoquant.py manually.

andrewprzh commented 8 months ago

Dear @HabiBBBaghirov

Thanks for the feedback and sorry for the delayed response!

Could you send me your command line (or the entire log) as well as a few lines from grouping file? This functionality should work fine in the release version.

Best Andrey

bpuzek commented 6 months ago

Hi @andrewprzh,

thank you for the great software!

I might be experiencing the same issue as @HabiBBBaghirov, with read groups being recognised but not used for generating a per-group matrix. All count matrices have a feature id column, some additionally have a 'count' and some 'NA' as the second column and no further columns.

command I used: isoquant.py --threads 24 --fastq reads1.fastq.gz reads2.fastq.gz --reference input/GRCh38.primary_assembly.genome.fa --genedb input/gencode.v45.annotation.db --complete_genedb --data_type nanopore --sqanti_output --read_group file:isoquant_bcodes.txt:0:1:, --output isoquant

log file says: Splitting read group file isoquant_bcodes.txt for better memory consumption

few lines from the grouping file:

819574b1-e2fd-4388-b20f-b434461be7f8,CATGCGGAGTTCCGGC
e0d80be6-0cf1-41a2-8430-d00f128c998c,CAATACGAGACTACCT
d4e4aafd-6acf-419e-bf9c-672aa9ba8691,GTGAGCCAGTACCCTA

I hope that this is helpful and that we can find a solution!

Best, Barbara

andrewprzh commented 6 months ago

Dear @bpuzek

Could you send me the log file and files in the list of files you got in your output folder?

Best Andrey

bpuzek commented 6 months ago

Dear Andrey,

here are the file list and log files. isoquant.log alignment.log filelist.txt

Many thanks in advance, Barbara

andrewprzh commented 6 months ago

@bpuzek

Thanks! Could you send me also first few lines of OUT.transcript_grouped_counts.tsv? Everything looks intact, the output should be there...

Best Andrey

bpuzek commented 6 months ago

Dear @andrewprzh,

here the first few lines of the file. Only one column of counts, instead of one column per cell as expected:

#feature_id     NA
ENST00000003583.12      4.67
ENST00000003912.7       0.00
ENST00000008440.9       12.00

Best, Barbara

bpuzek commented 6 months ago

Dear @andrewprzh any news on this?

Thanks, Barbara

andrewprzh commented 6 months ago

@bpuzek

Sorry, this is quite puzzling.I tried to run some of my data providing a similar CSV file with --read_group, everything works fine...

Would it be possible to send me some data via email? E.g. a small portion of the BAM files (like a thousand reads) and respective barcodes?

Best Andrey

bpuzek commented 6 months ago

Dear @andrewprzh,

Preparing the files to send I noticed that read IDs in read_group file were not matching those in fastq files and thus all reads were assigned to the 'NA' group. @HabiBBBaghirov maybe this helps.

Now I encountered another problem - after rerunning the pipeline, I expected >2000 read groups in the output files and I only get 918. Each columns corresponds to a group, in contrast to what README says: "In the number of groups exceeds 10, file will contain 3 columns". Did you ever do tests on this many groups?

I noticed that the groups in the count matrix in general have more reads assigned to them than those that are missing from the count matrix. Is there some sort of a filter that would remove a group supported by less than n reads?

To note, reads belonging to read groups missing from the count matrix are present in corrected_reads.bed, read_assignments.tsv and transcript_model_reads.tsv files.

Thank you so much for all your help! Barbara

andrewprzh commented 6 months ago

Dear @bpuzek

Each columns corresponds to a group, in contrast to what README says: "In the number of groups exceeds 10, file will contain 3 columns". Did you ever do tests on this many groups?

Thanks for pointing out, I now removed this from the readme. Each group should have its own column independently of the number of groups.

Did you ever do tests on this many groups?

Yes, I ran IsoQuant on single-cell data a few time with ~5K cells. Takes some RAM (especially if --count_exons is set) but works fine.

Now I encountered another problem - after rerunning the pipeline, I expected >2000 read groups in the output files and I only get 918

Sounds odd, but could it happen that some groups had no assigned reads?

I noticed that the groups in the count matrix in general have more reads assigned to them than those that are missing from the count matrix. Is there some sort of a filter that would remove a group supported by less than n reads?

There are no filters, even a single unique read should be enough. Are those reads uniquely assigned?

I wonder if I could get a subsample of your data to try it out. By the way, did you try the new IsoQuant 3.4, does it reproduce there?

Best Andrey

ablab / IsoQuant

--read_group:file functionality working in release version? #167