Closed AngieOrtizHerrera closed 7 months ago
Hello, I have never tried this myself, but it should work in theory. If you include the target sequences in fasta format as reference_sequence
and run it as the reference
assembly method it should work well. I would set the datatype
to pairddrad. You can leave the restriction_overhang as default, it will be ignored in this case. Try running it this way and let me know how it goes!
Hi, Thanks for your corncern. I already tried with 7 paired samples with your directions, in a cluster, and I get this error:
Traceback (most recent call last):
File "/home/software/anaconda3/envs/ipyrad-0.9.92.2/bin/ipyrad", line 10, in
I also tried with another samples (GBS with two enzimes EcoRI/MseI) with reference genome, so I adjust pairddrad, is that ok?
Thanks!! Angie
@AngieOrtizHerrera Can you show me your params file? That error looks like you don't have permissions to create the project_dir
. Make sure the project for is located somewhere you have read/write access to.
pairddrad for 2 enzyme gbs is just fine.
Here we go I have another project running (GBS), I have a "size" issue with it but it is working fine, so I think the installation in the cluster and my general permissions are fine.
------- ipyrad params file (v.0.9.92)------------------------------------------- Juncus_test1 ## [0] [assembly_name]: Assembly name. Used to name ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
/home/maortiz/juncus/ipyrad/test1 ## [1] [project_dir]: Project dir (made in curdir if not present) /home/maortiz/juncus/seq/*.fastq ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files reference ## [5] [assembly_method]: Assembly method (denovo, reference) /home/maortiz/juncus/ipyrad/test1/mega353.fasta ## [6] [reference_sequence]: Location of reference sequence file pairddrad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read 33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard) 6 ## [11] [mindepth_statistical]: Min depth for statistical base calling 6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling 10000 ## [13] [maxdepth]: Max cluster depth within samples 0.9 ## [14] [clust_threshold]: Clustering threshold for de novo assembly 0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes 0 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter) 7 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim 4 ## [18] [max_alleles_consens]: Max alleles per site in consensus seq, 4 si tetraploides 0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus 0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus 4 ## [21] [min_samples_locus]: Min # samples per locus for output, empezar por 50% muestras 0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus 8 ## [23] [max_Indels_locus]: Max # of indels per locus 0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus 0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs) 0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
## [28] [pop_assign_file]: Path to population assignment file
## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3
It looks like the first few lines of the params file are out of order? That would certainly cause a problem. project_dir
should be the 2nd parameter in the file (the third line after the header and assembly_name
.
------- ipyrad params file (v.0.9.92)-------------------------------------------
Juncus_test1 ## [0] [assembly_name]: Assembly name. Used to name ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
## [3] [barcodes_path]: Location of barcodes file output dir for assembly steps
/home/maortiz/juncus/ipyrad/test1 ## [1] [project_dir]: Project dir (made in curdir if not present)
/home/maortiz/juncus/seq/*.fastq ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
It should look like this:
------- ipyrad params file (v.0.9.92)-------------------------------------------
Juncus_test1 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/maortiz/juncus/ipyrad/test1 ## [1] [project_dir]: Project dir (made in curdir if not present)
## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
## [3] [barcodes_path]: Location of barcodes file output dir for assembly steps
/home/maortiz/juncus/seq/*.fastq ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
Thanks! I rewrote the parameter file and it works. I attach the stats, I think it was not so bad, I did a test with only 7 samples.
Awesome, thanks for letting me know. Glad it worked! The stats file did not attach properly so I can't see it. If you want to email it to me i'll take a look. Closing this issue for now as it seems that it's working okay!
Great, I attach the stats. I only analised seven samples as a test. I got around 300.000 mapped reads and 5 million unmmaped reads/sample. Do you think with denovo I could get more variants? Make it sense?
I have another issue, with the size of the files.
I need to analise 188 paired GBS samples. The script run nicely but stops in step 3 because the terabyte I have assigned in my cluster gets full.
“Exception ignored in: <_io.TextIOWrapper name='
Mª Ángeles Ortiz Herrera, PhD Prof. Titular de Universidad Dpto. Biología Vegetal y Ecología Facultad de Biología Universidad de Sevilla
From: Isaac Overcast @.> Date: Saturday, 27 April 2024 at 17:56 To: dereneaton/ipyrad @.> Cc: MARIA ANGELES ORTIZ HERRERA @.>, Mention @.> Subject: Re: [dereneaton/ipyrad] Hyb-seq data, can be used with ipyrad? (Issue #554)
Awesome, thanks for letting me know. Glad it worked! The stats file did not attach properly so I can't see it. If you want to email it to me i'll take a look. Closing this issue for now as it seems that it's working okay!
— Reply to this email directly, view it on GitHubhttps://github.com/dereneaton/ipyrad/issues/554#issuecomment-2080959873, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEU4ZBSXWFEQRCRUX5RT7KTY7PDBBAVCNFSM6AAAAABF74XBBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBQHE2TSOBXGM. You are receiving this because you were mentioned.Message ID: @.***>
clusters_total filtered_by_depth filtered_by_maxH filtered_by_maxAlleles filtered_by_maxN reads_consens nsites nhetero heterozygosity
JB11 37389 32467 38 0 0 4829 217405 316 0.00145 JB12 47369 42857 58 0 0 4364 224051 2127 0.00949 JB39 35810 31751 31 0 0 3978 180622 370 0.00205 JB43 29067 26438 18 0 0 2591 159544 269 0.00169 JB51 45055 40983 48 0 0 3949 197118 1837 0.00932 JB64 39665 34133 37 0 0 5426 238874 333 0.00139 JB84 45865 41996 51 0 0 3764 182148 1694 0.00930 reads_raw trim_adapter_bp_read1 trim_adapter_bp_read2 trim_quality_bp_read1 trim_quality_bp_read2 reads_filtered_by_Ns reads_filtered_by_minlen reads_passed_filter JB11 6136389 0 0 0 0 629 0 6135760 JB12 7008750 0 0 0 0 662 0 7008088 JB39 4997245 0 0 0 0 518 0 4996727 JB43 4942486 0 0 0 0 513 0 4941973 JB51 5540755 0 0 0 0 584 0 5540171 JB64 6598989 0 0 0 0 644 0 6598345 JB84 4883331 0 0 0 0 523 0 4882808 clusters_total hidepth_min clusters_hidepth avg_depth_total avg_depth_mj avg_depth_stat sd_depth_total sd_depth_mj sd_depth_stat JB11 37389 6 6716 7.94 34.27 34.27 97.02 227.06 227.06 JB12 47369 6 7075 7.58 38.95 38.95 132.29 340.61 340.61 JB39 35810 6 5569 8.72 44.85 44.85 115.61 290.49 290.49 JB43 29067 6 3364 8.59 59.58 59.58 124.77 362.72 362.72 JB51 45055 6 6181 8.02 45.77 45.77 131.58 352.91 352.91 JB64 39665 6 7615 8.15 33.13 33.13 100.48 227.62 227.62 JB84 45865 6 6171 6.06 32.00 32.00 85.81 232.26 232.26 hetero_est error_est JB11 0.003562 0.000566 JB12 0.011514 0.000919 JB39 0.004313 0.000397 JB43 0.003997 0.000636 JB51 0.011209 0.000673 JB64 0.003594 0.000512 JB84 0.011605 0.000799
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 12029 filtered_by_rm_duplicates 0 0 12029 filtered_by_max_indels 0 0 12029 filtered_by_max_SNPs 0 0 12029 filtered_by_max_shared_het 54 54 11975 filtered_by_min_sample 9286 9286 2689 total_filtered_loci 9340 9340 2689
sample_coverage
reference 2689 JB11 2419 JB12 1931 JB39 2305 JB43 1909 JB51 1949 JB64 2468 JB84 1906
locus_coverage sum_coverage 1 0 0 2 0 0 3 0 0 4 841 841 5 466 1307 6 481 1788 7 901 2689 8 0 2689
The distribution of SNPs (var and pis) per locus.
var sum_var pis sum_pis
0 2101 0 2344 0 1 284 284 152 152 2 102 488 60 272 3 48 632 30 362 4 34 768 13 414 5 16 848 17 499 6 14 932 11 565 7 5 967 8 621 8 10 1047 6 669 9 4 1083 7 732 10 2 1103 6 792 11 2 1125 2 814 12 2 1149 3 850 13 4 1201 2 876 14 3 1243 4 932 15 7 1348 2 962 16 4 1412 0 962 17 7 1531 2 996 18 1 1549 0 996 19 4 1625 2 1034 20 0 1625 2 1074 21 0 1625 2 1116 22 4 1713 1 1138 23 1 1736 2 1184 24 3 1808 0 1184 25 2 1858 2 1234 26 1 1884 1 1260 27 3 1965 1 1287 28 3 2049 1 1315 29 4 2165 2 1373 30 0 2165 1 1403 31 1 2196 0 1403 32 0 2196 1 1435 33 2 2262 0 1435 34 3 2364 1 1469 35 0 2364 0 1469 36 1 2400 0 1469 37 0 2400 0 1469 38 1 2438 0 1469 39 0 2438 0 1469 40 2 2518 0 1469 41 2 2600 1 1510 42 0 2600 0 1510 43 0 2600 0 1510 44 1 2644 0 1510 45 0 2644 0 1510 46 0 2644 0 1510 47 0 2644 0 1510 48 0 2644 0 1510 49 1 2693 0 1510
state reads_raw reads_passed_filter refseq_mapped_reads refseq_unmapped_reads clusters_total clusters_hidepth hetero_est error_est reads_consens loci_in_assembly
JB11 7 6136389 6135760 296799 5838961 37389 6716 0.003562 0.000566 4829 2419 JB12 7 7008750 7008088 358840 6649248 47369 7075 0.011514 0.000919 4364 1931 JB39 7 4997245 4996727 312161 4684566 35810 5569 0.004313 0.000397 3978 2305 JB43 7 4942486 4941973 249704 4692269 29067 3364 0.003997 0.000636 2591 1909 JB51 7 5540755 5540171 361458 5178713 45055 6181 0.011209 0.000673 3949 1949 JB64 7 6598989 6598345 323260 6275085 39665 7615 0.003594 0.000512 5426 2468 JB84 7 4883331 4882808 278010 4604798 45865 6171 0.011605 0.000799 3764 1906
snps matrix size: (8, 2693), 23.64% missing sites. sequence matrix size: (8, 130760), 23.77% missing sites.
The stats look reasonable-ish. It's interesting that so many of the reads are unmapped, not sure why that would be, perhaps off target reads? It might be worth trying de Novo, just to see.
In terms of disk space, if you are running out of quota there's not much you can do besides try to increase your allocation. Removing temporary folders after step 2 could cause problems, and it won't solve the longer term problem because step 3 generates lots of temp files as well (also steps 5-7 generate more temp files).
300 million reads for a sample is a LOT. It might work to downsample the number of reads for the really huge samples. You could do something like this:
zcat big_sample_R1_.fastq.gz | head -n 32000000 > subsample_R1_.fastq
gzip subsample_R1_.fastq
It would be a headache to do this for many samples, but it could be done. Here you just take the first 32 million lines of the fastq (equal to the first 8 million reads). I've done this before and it works fine. 8million reads is still a lot.
Good luck, and let me know how it goes, -isaac
Thanks!! About Hybseq: The unmapped ones are because the technique is hybrid: we did the gene target capture and also a shotgun, so we get a soup of different things, plastid, etc…. I will try the denovo and will write you again, not soon because I am struggling with a deadline of the gbs data.
Thanks for the tip! I am sure It will be enough. Unfortunately I can´t get an increase of the space. Maybe I can run it in another cluster, but I am still not sure, anyway not before summer I´m afraid.
Thank you for your kindness,
Mª Ángeles Ortiz Herrera, PhD Prof. Titular de Universidad Dpto. Biología Vegetal y Ecología Facultad de Biología Universidad de Sevilla
From: Isaac Overcast @.> Date: Saturday, 27 April 2024 at 19:43 To: dereneaton/ipyrad @.> Cc: MARIA ANGELES ORTIZ HERRERA @.>, Mention @.> Subject: Re: [dereneaton/ipyrad] Hyb-seq data, can be used with ipyrad? (Issue #554)
The stats look reasonable-ish. It's interesting that so many of the reads are unmapped, not sure why that would be, perhaps off target reads? It might be worth trying de Novo, just to see.
In terms of disk space, if you are running out of quota there's not much you can do besides try to increase your allocation. Removing temporary folders after step 2 could cause problems, and it won't solve the longer term problem because step 3 generates lots of temp files as well (also steps 5-7 generate more temp files).
300 million reads for a sample is a LOT. It might work to downsample the number of reads for the really huge samples. You could do something like this:
zcat big_sampleR1.fastq.gz | head -n 32000000 > subsampleR1.fastq
gzip subsampleR1.fastq
It would be a headache to do this for many samples, but it could be done. Here you just take the first 32 million lines of the fastq (equal to the first 8 million reads). I've done this before and it works fine. 8million reads is still a lot.
Good luck, and let me know how it goes, -isaac
— Reply to this email directly, view it on GitHubhttps://github.com/dereneaton/ipyrad/issues/554#issuecomment-2081111215, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEU4ZBS7ODUC67X667IPQALY7PPT3AVCNFSM6AAAAABF74XBBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGEYTCMRRGU. You are receiving this because you were mentioned.Message ID: @.***>
Dear Isaac, I attach the denovo stats results of my 7 samples Hybseq analysis. What do you think? I takes one week running with 20 nuc. I need to do it with like 300 samples, with method do you recommend me? With the target or de novo? Thanks!!!
Mª Ángeles Ortiz Herrera, PhD Prof. Titular de Universidad Dpto. Biología Vegetal y Ecología Facultad de Biología Universidad de Sevilla
From: Isaac Overcast @.> Date: Saturday, 27 April 2024 at 19:43 To: dereneaton/ipyrad @.> Cc: MARIA ANGELES ORTIZ HERRERA @.>, Mention @.> Subject: Re: [dereneaton/ipyrad] Hyb-seq data, can be used with ipyrad? (Issue #554)
The stats look reasonable-ish. It's interesting that so many of the reads are unmapped, not sure why that would be, perhaps off target reads? It might be worth trying de Novo, just to see.
In terms of disk space, if you are running out of quota there's not much you can do besides try to increase your allocation. Removing temporary folders after step 2 could cause problems, and it won't solve the longer term problem because step 3 generates lots of temp files as well (also steps 5-7 generate more temp files).
300 million reads for a sample is a LOT. It might work to downsample the number of reads for the really huge samples. You could do something like this:
zcat big_sampleR1.fastq.gz | head -n 32000000 > subsampleR1.fastq
gzip subsampleR1.fastq
It would be a headache to do this for many samples, but it could be done. Here you just take the first 32 million lines of the fastq (equal to the first 8 million reads). I've done this before and it works fine. 8million reads is still a lot.
Good luck, and let me know how it goes, -isaac
— Reply to this email directly, view it on GitHubhttps://github.com/dereneaton/ipyrad/issues/554#issuecomment-2081111215, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEU4ZBS7ODUC67X667IPQALY7PPT3AVCNFSM6AAAAABF74XBBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGEYTCMRRGU. You are receiving this because you were mentioned.Message ID: @.***>
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 36831 filtered_by_rm_duplicates 11538 11538 25293 filtered_by_max_indels 313 313 24980 filtered_by_max_SNPs 208 35 24945 filtered_by_max_shared_het 122 106 24839 filtered_by_min_sample 22576 22576 2263 total_filtered_loci 34757 34568 2263
sample_coverage
JB11 1666 JB12 964 JB39 1748 JB43 1115 JB51 1358 JB64 1786 JB84 1176
locus_coverage sum_coverage 1 0 0 2 0 0 3 0 0 4 1654 1654 5 471 2125 6 124 2249 7 14 2263
The distribution of SNPs (var and pis) per locus.
var sum_var pis sum_pis
0 699 0 1361 0 1 308 308 271 271 2 231 770 168 607 3 192 1346 124 979 4 183 2078 111 1423 5 146 2808 60 1723 6 106 3444 44 1987 7 105 4179 40 2267 8 56 4627 31 2515 9 63 5194 18 2677 10 51 5704 15 2827 11 23 5957 7 2904 12 25 6257 8 3000 13 15 6452 1 3013 14 16 6676 3 3055 15 6 6766 0 3055 16 11 6942 1 3071 17 9 7095 0 3071 18 4 7167 0 3071 19 2 7205 0 3071 20 2 7245 0 3071 21 2 7287 0 3071 22 1 7309 0 3071 23 0 7309 0 3071 24 1 7333 0 3071 25 2 7383 0 3071 26 1 7409 0 3071 27 0 7409 0 3071 28 0 7409 0 3071 29 0 7409 0 3071 30 0 7409 0 3071 31 1 7440 0 3071 32 1 7472 0 3071 33 0 7472 0 3071 34 0 7472 0 3071 35 1 7507 0 3071
state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens loci_in_assembly
JB11 7 6136389 6135760 1535776 66822 0.034006 0.008059 43685 1666 JB12 7 7008750 7008088 2192356 60562 0.048400 0.012410 29346 964 JB39 7 4997245 4996727 1258484 53668 0.035666 0.006538 35375 1748 JB43 7 4942486 4941973 1268552 27642 0.035989 0.006607 16535 1115 JB51 7 5540755 5540171 1752318 44729 0.048503 0.010135 23510 1358 JB64 7 6598989 6598345 1692910 78747 0.032837 0.007635 52564 1786 JB84 7 4883331 4882808 1726516 43727 0.048839 0.012985 21131 1176
snps matrix size: (7, 7507), 41.19% missing sites. sequence matrix size: (7, 239313), 39.72% missing sites.
hetero_est error_est
JB11 0.034006 0.008059 JB12 0.048400 0.012410 JB39 0.035666 0.006538 JB43 0.035989 0.006607 JB51 0.048503 0.010135 JB64 0.032837 0.007635 JB84 0.048839 0.012985 clusters_total hidepth_min clusters_hidepth avg_depth_total avg_depth_mj avg_depth_stat sd_depth_total sd_depth_mj sd_depth_stat JB11 1535776 6 66822 2.11 11.33 11.33 4.47 18.66 18.66 JB12 2192356 6 60562 1.80 10.05 10.05 2.76 13.29 13.29 JB39 1258484 6 53668 2.25 16.57 16.57 8.01 35.62 35.62 JB43 1268552 6 27642 1.75 12.69 12.69 3.51 20.28 20.28 JB51 1752318 6 44729 1.84 15.45 15.45 4.86 26.62 26.62 JB64 1692910 6 78747 2.16 11.77 11.77 5.04 20.70 20.70 JB84 1726516 6 43727 1.79 12.79 12.79 3.64 19.25 19.25 clusters_total filtered_by_depth filtered_by_maxH filtered_by_maxAlleles filtered_by_maxN reads_consens nsites nhetero heterozygosity JB11 1535776 1469874 4670 35 17264 43685 6317952 31212 0.00494 JB12 2192356 2132606 5587 19 24614 29346 4457580 51201 0.01149 JB39 1258484 1205321 3717 30 13909 35375 5347445 23725 0.00444 JB43 1268552 1241193 1573 13 9158 16535 2443912 10255 0.00420 JB51 1752318 1707928 3878 38 16892 23510 3757638 26308 0.00700 JB64 1692910 1615247 5675 41 19112 52564 7770540 36621 0.00471 JB84 1726516 1683237 4087 36 17935 21131 3275948 28162 0.00860
The results are pretty similar for denovo vs reference-based. It seems the denovo method isn't picking up much in terms of new loci (which makes sense of the off-target is shotgun). I would probably go for the reference method using the bait sequences as the pseudo-reference, as you have done. Plus it'll run faster in reference mode with the 300 samples. Let me know how it goes.
I have Hyb_seq data and I would like to know if I can use ipyrad for SNPs calling. If it´s possible, with which datatype and Restriction_overhang? I used a gene panel to get my reads...