alignment error and bed4 for the chimeric step

sangho1130 commented 3 months ago

Hi, Thanks for developing freddie tool! I was trying to apply freddie to our cancer dataset, however, currently I'm experiencing some issues.

1) Our data are nanopore long reads and previously we used minimap2 for the alignment. Here, when I try to feed freddie star, which uses starlong to align long reads, I get this following error message in a log file.

<_EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length @43cf7361-87a6-4fac-ab84-0c16836216fb_REV_PS=857_PE=895_AE=916_T=45_X=AAAAAAAAAAAAATTAAGAACTCTCGAGAGCCTAGAAATCAGA_Q=22.91hur0f (...sequence excluded) SOLUTION: fix your fastq file>

I'm wondering if freddie is compatible with nanopore data or should I pass the alignment step and use pre-aligned bam files in this case.

2) I was playing around with some of my short read data with freddie, but I am stuck at the "chimeric" step. I think a bed4 file which is required is missing and not generated(?) in the previous step. All I got from "string" were <sample~.freddie.gtf> and and log files. Are there additional commands after running string?

Thanks for your help!

rmercuri commented 3 months ago

Hi @sangho1130! Thank you for using FREDDIE.

Our data are nanopore long reads and previously we used minimap2 for the alignment. Here, when I try to feed freddie star, which uses starlong to align long reads, I get this following error message in a log file.

<_EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length 43cf7361-87a6-4fac-ab84-0c16836216fb_REV_PS=857_PE=895_AE=916_T=45_X=AAAAAAAAAAAAATTAAGAACTCTCGAGAGCCTAGAAATCAGA_Q=22.91hur0f (...sequence excluded) SOLUTION: fix your fastq file>

I'm wondering if freddie is compatible with nanopore data or should I pass the alignment step and use pre-aligned bam files in this case.

Regarding to this issue in the STAR step: This error usually occurs when the number of nucleotides in the read does not match the quality score length (which can happen for various reasons, such as adapter trimming). However, since you already have the data aligned by minimap2, I don't think it's a problem to skip this step. FREDDIE gives you this option with the -f flag. You could use:

## Create a file with the bam paths to the docker folder
$ ls /path/bam-files/*bam
/path/bam-files/sample1_sorted.bam
/path/bam-files/sample2_sorted.bam

$ ls /path/bam-files/*bam |  awk -F "/" '{print "/home/bams/"$NF}' > $PWD/files.txt

## Example of file 
$ cat $PWD/files.tsv
/home/bams/sample1_sorted.bam
/home/bams/sample2_sorted.bam

$ time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/freddie -v /path/bam-files/:/home/bams/\
    freddie string -o /home/freddie/K562 \
        -a /home/freddie/db/gencode.v36.annotation.gtf \
        -f /home/freddie/files.txt

Important: From the moment you use this step with the bams from minimap2, just make sure that your reference files (reference genome and gtf file) have the same chromosome naming convention (for example, both should be either with "chr" or without).

I was playing around with some of my short read data with freddie, but I am stuck at the "chimeric" step. I think a bed4 file which is required is missing and not generated(?) in the previous step. All I got from "string" were <sample~.freddie.gtf> and and log files. Are there additional commands after running string?

Regarding this point, it depends on what you want to identify. The BED4 file must be provided by the user with the events they want to find in the BED4 format. In the Praticcal workflow, we use retrocopies from our RCPedia database. However, if you are interested in another retroelement, you can use the BED file from RepeatMasker.

I have included a link to download the zipped BED file with all the elements; it is important that you filter those of interest and format them as required by the tool. For the file that you can download from that link, this would be the process to handle it:

$ gunzip rmsk.bed.gz
$ head rmsk.bed
chr1    67108753    67109046    L1P5    1892    +
chr1    8388315 8388618 AluY    2582    -
chr1    25165803    25166380    L1MB5   4085    +
chr1    33554185    33554483    AluSc   2285    -
chr1    41942894    41943205    AluY    2451    -
chr1    50331336    50332274    HAL1    1587    +
chr1    58719764    58720546    L2a 1393    +
chr1    75496057    75497775    L1MA9   5372    +
chr1    92274205    92275925    L2  536 +
chr1    100662981   100669120   L1PA4   25118   -

$ fgrep Alu rmsk.bed | cut -f 1-4 | sort -k1,1 -k2,2n | head
chr1    26790   27053   AluSp
chr1    31435   31733   AluJo
chr1    33465   33509   Alu
chr1    35366   35499   AluJr
chr1    39623   39924   AluSx
chr1    40628   40729   AluSz6
chr1    51584   51880   AluYj4
chr1    61862   62160   AluSc
chr1    76892   77201   AluSz
chr1    78285   78421   AluJr

$ fgrep Alu rmsk.bed | cut -f 1-4 | sort -k1,1 -k2,2n > rmsk.bed4

Thank you once again for getting in touch.

sangho1130 commented 3 months ago

@rmercuri Thank you so much!!

sangho1130 commented 3 months ago

@rmercuri Oh, one last thing if I may, I think the "databases" is offline (https://github.com/galantelab/freddie?tab=readme-ov-file#databases). Are there other repositories that I can access those files?

Thanks!

rmercuri commented 3 months ago

@sangho1130 We update the files and the links were changed. But now it's working! I'm sorry for that.

@rmercuri Oh, one last thing if I may, I think the "databases" is offline (https://github.com/galantelab/freddie?tab=readme-ov-file#databases). Are there other repositories that I can access those files?

Thanks!

sangho1130 commented 3 months ago

Thank you so much @rmercuri !

rmercuri commented 3 months ago

No problem! I you need any help with the pipeline, feel free to contact me via email or here. Additionally, if possible, please provide a review of your experience using the tool with your data (Attended your expectations? haha).

galantelab / fredy

alignment error and bed4 for the chimeric step #2