WGLab / DeepRepeat

An accurate repeat detection from Nanopore data using deep learning and image techniques
Other
19 stars 4 forks source link

Seeking guidance to use on GRCh38 genome #1

Open MeiShu00 opened 3 years ago

MeiShu00 commented 3 years ago

Hi,

I would like to seek your help in using DeepRepeat for my nanopore data. After installation i had this warning after entering python DeepRepeat.py Detect: image

I also have the following questions:

  1. Would i need to train my own DeepRepeat models if i my reads are aligned to GRCh38? If yes, what are the min specs required of the computer? It was mentioned that aligned BAM files also need to be generated after basecalling with Albacore v2.3, may i know what tool should be used for alignment?
  2. If i have 2 barcodes (01 and 02), when calling DeepRepeat.py:
    • When calling STR loci with Barcode 01 data, do i put the --f5folder path to fast5_pass files of BC01?
  3. For PCR-based run, do i just use the flag --is_pcr in the command line ?
  4. --basecalled_path refers to the fastq files converted from fast5 files ?

Thank you !

liuqianhn commented 3 years ago

@MeiShu00 Thanks for being interested in DeepRepeat.

  1. You will achieve improved performance after you have further training on your specific data, while using my trained models as pre-trained models. If your data set is not large, you can use a general GPU for training purposes. DeepRepeat is independent of alignment, and thus you can use any aligners to generate BAM files. I usually minimap2.
  2. To speed up the testing process, an index will be built with a default 'sequence_summary.txt' located under the basecaling folder. If you have 2 barcodes but with a single 'sequence_summary.txt', you can provide the same --f5folder path for the basecalling folder (this folder should have a 'sequence_summary.txt'), but you can provide different BAM files and different "--basecalled_path"(such as --basecalled_path "workspace/pass/barcode01" and --basecalled_path "workspace/pass/barcode02"). You might also need to use "--f5i" for different index files (such as --f5i barcode1.f5index and --f5i barcode1.f5index). The starting point of the analysis is based on BAM files.
  3. --is_pcr is for peak calling step: usually more supporting long reads are expected for larger repeat counts for targeting sequencing with PCR. If you can check the repeat count distribution, this parameter will not change the distribution.
  4. --basecalled_path is how to find fast5 in the basecalling folder. Usually, basecallers have a folder "workspace/pass" for fast5 files. If your basecallers have a different path from "workspace/pass", you can use this parameter to change the default setting.
MeiShu00 commented 3 years ago

@liuqianhn

Thank you for your quick response! However, i still have the following questions:

  1. Regarding this warning that i see after entering python DeepRepeat.py Detect into the terminal: image

Can this be ignored or are the packages version clashing with one another? I installed all packages by the Install DeepRepeat via conda instructions on the github page.

2.Can your trained models be used for data that are guppy basecalled and aligned to GRCh38 genome?

  1. If i choose to train my own model, would there be an issue if my data is guppy basecalled? (Since training of model requires basecalling via albacore v2.3 )

  2. Regarding --f5i, i understand that if they do not exist, f5.f5index will be created. Does this mean that i do not need to generate them myself ? if so, do i still need to call this flag when running it in the terminal?

Thank you !

liuqianhn commented 3 years ago

@MeiShu00

  1. The warnings should be fine. It is the issue from different versions of tensorflow.
  2. Guppy generates move table, which I am working on to make deeprepeat work now.
  3. A difference between guppy and albacore is that: guppy outputs move table, while albacore outputs event table. As I discussed in 2, I am working to make deeprepeat flexible for both move and event tables.
  4. You do not need to have --f5i generally. But since you have different barcode folders, it is better to generate index files for each folder (otherwise, the default path for fast5 files is directly under workspace/pass/ folder.
MeiShu00 commented 3 years ago

@liuqianhn

Regarding point 4 in the previous comment, what tools would be recommended for generating index files for fast5 files in each folder ?

Thank you !

liuqianhn commented 2 years ago

@MeiShu00 sorry that I forgot to reply this message. DeepRepeat/bin/scripts/IndexF5files can be use to build index files.