RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Please provide an example for the creation of a cfg file from a fasta containing multiple sequences with AriocE #27

Closed claus-h-g closed 1 year ago

claus-h-g commented 1 year ago

many thanks for providing and supporting Arioc. So far I do fail to create a cfg file with AriocE from a fasta containing multiple sequences. I failed to derive the correct "parenthesized wildcard specification". I can not overcome this error: unable to open or create directory ./enc/ssi84_2_30 (errno=2: No such file or directory).

I would highly appreciate an example code.

RWilton commented 1 year ago

Please start with the examples in the Arioc distribution. The S. cerevesiae sample is surely the easiest to tackle first.

If you cannot make that work, feel free to post the configuration file(s) you are using as well as Arioc's output log.

Good luck.

claus-h-g commented 1 year ago

Many thanks for your kind reply. I appreciate English is not my mother tongue, but I did follow the examples in the readme file in Arioc.RQA.150.zip archive. For my understanding there is no example with one fasta containing multiple sequences. I did get the code for S. cerevesiae sample data to work. Ultimately we want to use Airoc in on WGBS data. In the current proof of concept, the anticipated CPG information has to match similar information obtained with Illumina EPIC or 850 k array. Hence I am constrained to hg19. Any liftover approach I have used so far was problemeatic in one way or the other. I would highly appreciate not entering any discussion on the use of hg19. I do work in a clinical setting and have to stick to what is accepted here and is likely to pass any accreditation.

I do work with hg.19.fa obtained with this link: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

I found so far one error in my approach. After fixing this I still do not get the results - I think I should obtain. But maybe I am expecting the wrong results. hg19.fa is in the subdirectory raw on level up I did created for my understanding AriocE.nongapped_hg_19_fa.cfg for the first two chromosomes chr1 and chr2:

$ cat AriocE.nongapped_hg_19_fa.cfg <?xml version="1.0" encoding="utf-8"?>

hg19.fa hg19.fa ./enc

I did run

/applications/Airoc/R/hg19$ ../../bin/$ AriocE AriocE.nongapped_hg_19_fa.cfg

162235023 [00005e46] AriocE v1.50.3065.22041 (release) +2022-02-10T19:56:42 162235026 [00005e46] Copyright (c) 2015-2022 Johns Hopkins University. All rights reserved. 162235026 [00005e46] compiled : 2022-07-27T16:47:14 (GNU g++ v7.5.0) 162235026 [00005e46] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 162235026 [00005e46] computer name : meqneuropat24 162235026 [00005e46] operating system : Ubuntu 18.04.6 LTS 162235026 [00005e46] executable file : /applications/Airoc/bin/AriocE 162235026 [00005e46] configuration file : /applications/Airoc/R/hg19/AriocE.nongapped_hg_19_fa.cfg 162236975 [00005e50] ApplicationException ([0x00024144] AriocE/tuSniffFASTA.cpp 158): FASTA file contains 93 reference sequences: ./raw/hg19.fa 162236976 [00005e4f] ApplicationException ([0x00024143] AriocCommon/../CppCommon/RaiiCriticalSection.h 179): pthread_mutex_lock returned 22

with out any error message. But in the enc subdirectory I did obtain

/applications/Airoc/R/hg19$ ls enc 'hg19$a21.sbf' 'hg19$raw.sbf' 'hg19$sqm.sbf' ssi84_2_30

According to your example I did expect a $a21.rc.sbf $a21.sbf raw.rc.sbf $sqm.rc.sbf $a21.sbf $raw.sbf $sqm.sbf for each of both selected cromosomes.

I am wondering is my cfg or my expectation wrong?

RWilton commented 1 year ago

I think what's missing is the fundamental notion that AriocE expects each .cfg description of a FASTA file to correspond to its actual contents. For example, if hg19.fa contains 93 sequences, you need a line in the .cfg that says, in effect, "file hg19.fa contains multiple reference sequences whose unique IDs are to be parsed from their FASTA definition lines according to a specified pattern."

Unfortunately, Arioc places a hard limit of 1G (one gigabase) on the size of an individual reference sequence file. This will prevent you from using a single FASTA file that simply aggregates all of the assembled human chromosomes as well as 65 or 70 additional sequences (alternate reference sequences, decoy sequences, or whatever). Instead, for a reference genome the size of GRCh37, you should place each of the assembled chromosomes (chr1-chr22, chrX, chrY, chrMT) in separate files, and aggregate the remaining short sequences in one or more additional files.

Here is an example you might use as a template for your work:

<!-- AriocE.gapped.cfg

    FASTA deflines look like this:
      >acc|GENBANK|MH171300.1|linear|Marine RNA virus BC-4 non-structural polyprotein and structural polyprotein genes, complete cds.|Marine RNA virus BC-4|ENV|28-SEP-2018
-->
<AriocE seed="hsi20_0_31" maxJ="200" verboseMask="0xE000000f">

    <dataIn sequenceType="R"
            filePath="./raw"
            srcId="1"
            AS="GRCh38"
            uriPath="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/">
        <file SN="chr1" subId="1">chr1.fna</file>
        <file SN="chr2" subId="2">chr2.fna</file>
        <file SN="chr3" subId="3">chr3.fna</file>
        <file SN="chr4" subId="4">chr4.fna</file>
        <file SN="chr5" subId="5">chr5.fna</file>
        <file SN="chr6" subId="6">chr6.fna</file>
        <file SN="chr7" subId="7">chr7.fna</file>
        <file SN="chr8" subId="8">chr8.fna</file>
        <file SN="chr9" subId="9">chr9.fna</file>
        <file SN="chr10" subId="10">chr10.fna</file>
        <file SN="chr11" subId="11">chr11.fna</file>
        <file SN="chr12" subId="12">chr12.fna</file>
        <file SN="chr13" subId="13">chr13.fna</file>
        <file SN="chr14" subId="14">chr14.fna</file>
        <file SN="chr15" subId="15">chr15.fna</file>
        <file SN="chr16" subId="16">chr16.fna</file>
        <file SN="chr17" subId="17">chr17.fna</file>
        <file SN="chr18" subId="18">chr18.fna</file>
        <file SN="chr19" subId="19">chr19.fna</file>
        <file SN="chr20" subId="20">chr20.fna</file>
        <file SN="chr21" subId="21">chr21.fna</file>
        <file SN="chr22" subId="22">chr22.fna</file>
        <file SN="chrX" subId="23">chrX.fna</file>
        <file SN="chrY" subId="24">chrY.fna</file>
        <file SN="chrMT" subId="25" TP="circular" uriPath="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_assembly_structure/non-nuclear/assembled_chromosomes/FASTA/">chrMT.fna</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="101" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.1.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="102" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.2.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="103" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.3.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="104" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.4.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="105" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.5.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="106" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.6.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="107" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.7.fasta</file>
        <file SN="*|*|(*)|" TP="*|*|*|(*)|" subId="108" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.8.fasta</file>
    </dataIn>

    <dataOut>
        <path>./enc</path>
    </dataOut>

</AriocE>

Notice how the sequence identifier SN and the sequence topology TP are specified as the third and fourth fields in each defline in the U-RVDB* files. (There is more about how this works in the Arioc User Guide.)

I hope this helps!