Subset small test dataset

kelly-sovacool commented 1 year ago

Subset to keep all reads that aligned to just one chromosome. Better than random sampling so read depth will still be high.

In progress on branch tests_iss-27

kelly-sovacool commented 1 year ago

I picked bam files for chromosome 22 from an example run (/data/CCBR/projects/techDev/runs/gui/hg38_pair-y_cnv-y_ffpe-y/bams/chrom_split), and used samtools to convert to fastq then gzipped them. However, XAVIER expects input fastq files to be paired-end, but with this method the reads have already been combined. How can I make faux read pairs from these chr22 fastq files?

Solution: I realized all the headers end in /1 or /2 to designate the forward and reverse reads, so I can split the file into two based on the fastq headers. https://www.biostars.org/p/141256/

kelly-sovacool commented 1 year ago

Now running into a weird issue with symlinking on biowulf?

dryrun

/data/CCBR/projects/techDev/XAVIER/bin/xavier run --runmode dryrun --input /data/CCBR/projects/techDev/test_xavier/data/fastqs_deinterleaved/*.fastq.gz --output /data/CCBR/projects/techDev/test_xavier/results/hg38_pair-n_cnv-n_ffpe-n --genome hg38 --targets /data/CCBR/projects/techDev/XAVIER/resources/Agilent_SSv7_allExons_hg38.bed

output

xavier
[-] Unloading samtools 1.17  ...
[-] Unloading snakemake  7.19.1
[+] Loading singularity  3.10.5  on cn4270
[+] Loading snakemake  7.19.1
xavier (v1.1)
Traceback (most recent call last):
  File "/vf/users/CCBR/projects/techDev/XAVIER/xavier", line 731, in <module>
    main()
  File "/vf/users/CCBR/projects/techDev/XAVIER/xavier", line 727, in main
    args.func(args)
  File "/vf/users/CCBR/projects/techDev/XAVIER/xavier", line 96, in run
    config = setup(sub_args,
             ^^^^^^^^^^^^^^^
  File "/vf/users/CCBR/projects/techDev/XAVIER/src/run.py", line 174, in setup
    ifiles = sym_safe(input_data = links, target = output_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vf/users/CCBR/projects/techDev/XAVIER/src/run.py", line 97, in sym_safe
    os.symlink(os.path.abspath(os.path.realpath(file)), renamed)
FileNotFoundError: [Errno 2] No such file or directory: '/vf/users/CCBR/projects/techDev/test_xavier/data/fastqs_deinterleaved/sample1-normal.chr22.split.R1.fastq.gz' -> '/vf/users/CCBR/projects/techDev/test_xavier/hg38_pair-n_cnv-n_ffpe-n/sample1-normal.chr22.split.R1.fastq.gz'

I also get this error from the GUI.

Solution: need to run init before dryrun.

kelly-sovacool commented 1 year ago

Selected raw reads that mapped to a small region of chromosome 22. Now testing on biowulf.

https://github.com/CCBR/XAVIER/tree/9fcd76bb9474ee76c919c34bf8a5a99925bae864/tests

kelly-sovacool commented 1 year ago

Regions for test dataset need to have enough coverage to make it through somalier analysis: https://github.com/brentp/somalier/issues/50

Solution: if fewer than e.g. 20 chromosomes, just touch the somalier output file instead of running it.

kelly-sovacool commented 1 year ago

Currently this test dataset works with paired/cnv off, but fails otherwise. Will need to further refine it to figure out why.

kelly-sovacool commented 3 months ago

the new subsampled dataset in tests/data/ will fail with --cnv and on somalier, but there's now a larger 25% subset available on biowulf that works for these steps: /data/CCBR_Pipeliner/testdata/XAVIER/human_subset. This should be good enough for our purposes.

CCBR / XAVIER

Subset small test dataset #27